Fine-Tuning Dataset Preparation: How Many Examples and What Quality Do You Really Need

Q: What if I have too little domain data and cannot supplement it synthetically?

In that case, consider [RAG instead of fine-tuning](/en/blog/rag-vs-fine-tuning-rozhodovanie) — RAG (Retrieval-Augmented Generation) requires no training data and works well for scenarios where you need access to knowledge rather than a change in response style or format. Fine-tuning is more appropriate when you are changing the model's behaviour, not its knowledge.

When a company first approaches fine-tuning its own language model, the focus is usually on the technical side: which model, which framework, which GPU. The dataset gets addressed later — and that is precisely where problems most often originate. We have seen projects where several days of GPU time were invested in training, only for the resulting model to end up shelved. Not because the method was wrong, but because the data was not prepared with the care it deserved.

This article focuses specifically on dataset preparation: how many examples you actually need, how to measure quality, what to do about duplicates, how to set up your training and evaluation split, and why something called a data-sufficiency gate exists — and why you should use it before you ever start training.

Why the dataset matters more than the method

Fine-tuning is, at its core, simple: show the model how to respond in a specific context, enough times for it to internalize the pattern. The problem arises when what you are showing it is not what you actually want — or when you show it too few times, or far too many times the same thing.

Research on language model training has repeatedly shown that quality outweighs quantity. A set of 1,000 carefully prepared, diverse, and correct examples can produce a better model than 50,000 examples assembled hastily with systemic errors. This is not intuitive — most technical teams will instinctively reach for more volume.

Worse than insufficient data, however, is poor quality at sufficient volume. A model trained on incorrect, biased, or contradictory examples will fix those as truth. And because fine-tuning increases the model's confidence in the patterns it has learned, the result is a model that answers confidently on questions where it should hesitate or say "I don't know." In domains such as law, medicine, or financial advice, this is a serious problem.

Rough minimums: SFT, DPO, and GRPO

Not every fine-tuning approach is the same. The three main methods have different data requirements:

SFT (Supervised Fine-Tuning) is the foundational method: the model receives input–output pairs and learns to replicate them. Functional results are achievable from roughly 1,000 high-quality examples, but production systems typically work with 10,000 to 100,000 pairs. What matters is that they cover the main scenarios of the target domain — not just the most frequent ones.

DPO (Direct Preference Optimization) requires preference pairs: for each input you have a "better" and a "worse" response. The model learns what you prefer. This requires either annotation by real humans or reliable automated evaluation. The recommended minimum is roughly 2,000 preference pairs with human-verified outcomes. Below this threshold there is a real risk that the model learns artifacts of the annotation process rather than genuine preferences.

GRPO and verifiable rewards are appropriate for tasks with objectively correct answers — mathematics, code, logic, or formats with a defined schema. Here the minimum is roughly 1,000 scored trajectories, but the critical prerequisite is that the reward is genuinely objective and automatically verifiable. If you define the reward manually and subjectively, you will encounter the same problems as with a poor DPO dataset.

These numbers are minimums for basic functionality, not quality guarantees. For production deployment in regulated sectors (law, medicine, pharmacy) a stricter standard applies: complete coverage of all target scenarios and jurisdictions, not just a representative sample.

What a quality example looks like

Before discussing volume, it is worth defining what you actually want from an individual example.

A quality SFT example has the following properties:

Correctness: the output is factually accurate and consistent with the context of the input
Consistency: the same input (or an equivalent formulation) yields the same category of response
Representativeness: the example covers a real scenario, not just an artificial test case
Clarity: the model can unambiguously understand from the example what is expected of it — no ambiguity
Appropriate length: not unnecessarily short (hollow patterns) and not unnecessarily long (the model loses the thread)

Typical problems we see in practice:

Outputs copied from existing documents without editing — carrying errors from the original source
Examples generated by an LLM without human review — the model learns another model's hallucinations
Inconsistent formatting — JSON in one case, Markdown in another, free text in a third, all for the same task type
Overlapping inputs with differing outputs — the model receives contradictory signals

Dataset format and file structure

Most modern frameworks (Unsloth, Axolotl, LLaMA-Factory, TRL) accept standard formats. The most commonly used are:

Alpaca format for instruction tasks: each example has instruction, input (optional), and output fields. Simple and well-supported.

ShareGPT / conversational format for chat models: examples are conversations with a list of messages, each carrying a role (system, user, assistant). Better suited for multi-turn scenarios.

JSONL (one JSON object per line) is the preferred file format for most tools — it allows streaming of large datasets without loading the entire file into memory.

When preparing a DPO dataset, one additional field is added: typically chosen and rejected (or the equivalent naming in your framework) for each input prompt.

Deduplication — the underestimated step

Duplicate or near-duplicate examples are among the most common problems in datasets assembled automatically or from corporate documentation. The effect is twofold: the model disproportionately learns the patterns contained in the duplicates (overfitting on a data subset), and evaluation is skewed if duplicates end up in both the training and test sets.

Basic deduplication works on exact match (hashing the input text). More advanced approaches use MinHash or embedding similarity to detect semantic duplicates — examples that are phrased differently but are content-equivalent.

For domain datasets we recommend at least the following steps:

1.Exact deduplication based on output hash
2.Checking that different formulations of the same input lead to consistent outputs
3.Removing examples where both the input and output are shorter than a meaningful lower bound (for example, excessively short answers)

Tools such as the datasets library from Hugging Face or datasketch (a MinHash implementation) cover these steps without the need to write custom code.

Train/eval split: numbers and logic

Splitting the dataset into a training set and an evaluation set is fundamental, yet unnecessary mistakes are made in practice.

The standard 80% training / 20% evaluation split works, but for small datasets (below roughly 5,000 examples) it is better to use 90/10 and supplement the evaluation set with repeated cross-validation or a separate held-out test set.

The cardinal rule: no example from the evaluation set may appear in the training set — not even as a semantic duplicate. If you are doing deduplication, do it before the split, not after.

For domain fine-tuning datasets we recommend constructing the evaluation set to include: - A representative sample of the main scenarios (same distribution as training) - Several deliberate edge cases and boundary situations - Examples where the correct answer is "I don't know" or "insufficient information" — if that is what you expect from the model

The evaluation set serves two purposes: measuring performance during training (validation loss) and independent assessment after training. For production decisions the second function is more valuable — which is why the evaluation set should be human-verified and not automatically generated.

Related article on measuring results after training: How to measure whether fine-tuning helped.

Synthetic data: benefits and risks

For most domain projects, the volume of existing human-written data is insufficient. The solution is to augment the dataset with synthetic data — data generated by a stronger (frontier) model on the basis of human seed examples.

A typical recipe: 150–200 human-written, verified seed examples → generate 10 to 100× more via a teacher model (Claude Opus, GPT-4o, or a similar frontier model) → human review of a sample (at least 10–20%) → quality filtering.

Risks of synthetic data:

Model collapse: if the training dataset consists entirely of synthetically generated data from a single teacher model, the fine-tuned model copies its weaknesses and quirks. The long tail of real-world scenarios remains uncovered.
Teacher model hallucinations: the teacher model is not infallible — it generates factual errors that, without human review, flow straight into training.
Style bias: a strong teacher model has a distinctive response style. If that style is not what you want for your use case, you need to explicitly correct for it in the prompt and during review.

Working with synthetic data requires more quality-control discipline, not less. More on this topic: Synthetic data for fine-tuning.

Data-sufficiency gate: don't start training before the data is ready

One mistake we see repeatedly is launching training with a dataset you already know is incomplete — on the assumption that you will "fill in the gaps later." The problem is that incomplete fine-tuning can be actively harmful.

A model trained on a dataset that covers only part of a domain will learn to answer with high confidence even on questions from the uncovered part. It has no mechanism for recognising that it "doesn't know" something — it only knows what it was taught. The result is worse than the base model, which at least knows it is not specialised for the domain.

Before starting training, we recommend verifying:

1.Coverage of main scenarios: do you have examples for every key task type the model will perform?
2.Minimum volume: do you meet the rough minimums for the chosen method (SFT, DPO, GRPO)?
3.Quality evaluation set: do you have a human-verified evaluation set that is separate from the training data?
4.For regulated sectors: do you cover all target jurisdictions and scenarios — not just a representative sample?

This checklist is not bureaucracy — it is insurance against a training investment that ends in a problematic model. More on choosing between methods: SFT, DPO, GRPO — which method and when.

Catastrophic forgetting: what fine-tuning can break

Fine-tuning on a narrow dataset has a side effect: the model may partially forget capabilities it had before training. This phenomenon — known as catastrophic forgetting — is well-documented in research and real in practice.

LoRA and QLoRA mitigate this problem because the original model weights remain frozen and the adapters are relatively small. But even PEFT methods do not eliminate it entirely — with overly aggressive training (high learning rate, large dataset with a narrow distribution) degradation of general capabilities will show up.

Practical implications:

Test not only on the domain evaluation set, but also on general benchmarks (at least informatively)
If the model fails on tasks it handled correctly before fine-tuning, that is a signal to adjust the training distribution or hyperparameters
For production deployment, always compare the fine-tuned model against the base model on the same task set — not just on the domain set

Pre-training checklist

Before starting training, run through this list:

1.Dataset is in a standard format (Alpaca or ShareGPT JSONL)
2.Exact deduplication has been performed on the entire dataset
3.Train/eval split is complete before any other processing
4.The evaluation set contains no semantic duplicates from the training set
5.At least 10% of the dataset has passed human quality review
6.Scenario coverage has been verified — no key categories are empty
7.Synthetically generated data is labelled and its proportion in the dataset is a deliberate choice
8.For DPO: every preference pair has a defined reason why "chosen" is better than "rejected"

This list is not exhaustive, but it covers the most common sources of problems we see with clients approaching their first fine-tuning project.

Frequently asked questions

What is the minimum number of examples I need for SFT?

Functional results are achievable from roughly 1,000 high-quality examples — this figure comes from research showing that carefully selected examples can outperform a dataset that is orders of magnitude larger but of lower quality. For production systems we recommend at least 10,000 examples with verified coverage of key scenarios. Regulated sectors apply stricter criteria.

Can I generate the entire dataset with an LLM?

A synthetically generated dataset can make up the majority of the volume, but not the entire dataset. You need human-written and verified seed examples (typically 150–200 as a minimum) and human review of a sample of synthetically generated examples. A model trained exclusively on the output of a single teacher LLM copies its errors and weaknesses without correction.

How do I split the dataset into training and test sets?

The standard split is 80% training / 20% evaluation; for small datasets under 5,000 examples, prefer 90/10. The key rule: perform deduplication before the split, not after. The evaluation set must not contain even semantic duplicates from the training set — otherwise you are measuring "ability to memorise," not "ability to generalise."

What if I have too little domain data and cannot supplement it synthetically?

In that case, consider RAG instead of fine-tuning — RAG (Retrieval-Augmented Generation) requires no training data and works well for scenarios where you need access to knowledge rather than a change in response style or format. Fine-tuning is more appropriate when you are changing the model's behaviour, not its knowledge.

Why does the model respond worse after fine-tuning than before?

The most common causes: poor dataset quality (errors in examples, inconsistent formatting), insufficient scenario coverage (the model "learned" only part of the domain and extrapolates incorrectly to the rest), or catastrophic forgetting from overly aggressive training. A closer look: 7 reasons why fine-tuning fails.

*Dataset preparation is where the success of an entire fine-tuning project is decided — not at the point of choosing a method or GPU. If you are preparing for your first fine-tuning run or are unsatisfied with the results of a previous attempt, MP Industrial Solutions is glad to help you assess the quality and structure of your data before training begins.*