Synthetic data for fine-tuning: when it helps and when it poisons your model

Every team that starts fine-tuning hits the same wall: real, well-annotated examples are scarce. Producing new examples by hand is expensive and slow. The question almost inevitably surfaces — what if we generated the data with a model?

It is a legitimate technique. Research teams use it, production systems use it. But it has precise conditions under which it works, and equally precise conditions under which it will quietly wreck the model you are tuning. This article breaks down both — without unnecessary optimism.

What synthetic data actually is (and is not)

Synthetic training data for fine-tuning consists of input–output examples generated automatically, not captured from real human behaviour. In practice this means one of three things:

Generation via a teacher model — a stronger model (e.g. a frontier API) receives an instruction and generates examples for a weaker target model. This is sometimes loosely called distillation, though it is not distillation in the original sense.
Augmentation of existing data — existing examples are paraphrased, reformatted, or expanded; semantic content is preserved while form changes.
Self-play and synthetic scenarios — a model generates data for itself (or plays both teacher and student roles), typically for reasoning or conversational fine-tuning.

Important: synthetic data is not a substitute for continued pretraining on raw domain text. Continued pretraining builds a knowledge base through unlabelled text. Synthetic data for SFT (supervised fine-tuning) teaches the model format and behaviour, not knowledge. These two layers complement each other but do not replace each other.

When synthetic data genuinely helps

Not every use case has enough real data. These are the situations where synthetic data delivers real value:

1. You have a strong seed set, but it is small. Research shows that a model trained on a thousand high-quality examples outperforms one trained on a hundred thousand average ones. If you have 150–200 real, carefully curated examples, you can expand them 10–50× with a teacher model — while preserving the distribution you wanted. This works well for structured tasks with verifiable outputs: entity extraction, classification, format transformation.

2. You are covering the long tail. Real data has a distribution — some cases are common, some rare. A model trained only on real data may struggle with edge cases that rarely appeared in historical data. A teacher model can deliberately target these edge cases.

3. You want to transfer reasoning from a larger model. This is the core principle of the distillation approach popularised by DeepSeek — a chain-of-thought from a frontier model is used as a training signal for a smaller model. The smaller model does not learn to "know" the same things, but it learns to *reason* similarly. The results are documented: 7B–8B models trained on chain-of-thought synthetic datasets can outperform generalist models several times their size on narrow reasoning tasks.

4. You need data augmentation for safety edge cases. Red-teaming and generating adversarial examples — where you want to show the model what it *should not* do — is another legitimate use of synthetic data. Real failure examples are rare; a synthetic teacher can generate them systematically.

See also: Fine-tuning dataset — how much and what quality for quantitative recommendations on dataset size.

The main risks: when it poisons the model

Synthetic data carries three categories of risk, each of which can silently degrade your model.

Risk 1: Propagation of the teacher's errors

The teacher model is not infallible. It has its own hallucination patterns, blind spots, and phrasing preferences. When it generates a thousand examples and you train your target model on them, the target model does not just learn the desired distribution — it also learns the teacher's quirks. In small doses this is tolerable. With large synthetic datasets and no filtering, it produces a model that confidently repeats errors you cannot even identify (because they are the model's errors, not humans').

A real-world example: a technical documentation client had a teacher model that consistently used an old trade name for one type of electrical component. A thousand generated examples later, the target model was subtly but consistently biased toward that same outdated nomenclature — even though no such pattern existed in the seed data.

Risk 2: Model collapse

This is technically the most serious risk and an active area of research. Model collapse occurs when a model trained on synthetic data from the same model (or similar models) progressively loses variability and converges on a narrow output distribution. Outputs are fluent and formally correct — but the model has stopped covering the range of real inputs.

The intuition: if the teacher generates data that is a distributed response of the same model (or its predecessor), each training iteration amplifies central patterns and weakens the edges. After several cycles the model handles average inputs well and stops being able to process unusual phrasings, edge cases, or data outside the training distribution.

In production systems this manifests as: the model "works" in tests (tests cover common cases), but in production clients complain that they sometimes receive a generic or nonsensical answer — precisely on the edge questions.

Protection: never train exclusively on synthetic data. Human seed data must constitute at least ~20–30 % of the dataset and must cover the diversity of inputs — not just the average cases. Systematic evaluation on out-of-distribution inputs before deployment is mandatory.

Risk 3: Licence and ToS restrictions

This risk is less technical but critical for B2B use. Most frontier models (Claude, GPT, Gemini) have explicit restrictions in their terms of service regarding generating training data for competing models. Exact wording varies and changes — always read the current ToS of the specific provider.

Practically: if you use a commercial API as a teacher model and plan to commercially distribute or deploy the target model for customers, you need a clear legal basis. The situation for internal deployment on your own infrastructure is different, but not automatically clean.

The safe path: open-weight models (Qwen, Mistral, and others with Apache 2.0 or MIT licences) typically permit synthetic data generation — but every model has its own terms; always verify before deployment. For a commercially clean synthetic pipeline with no legal question marks, both teacher and student models should come from families with permissive licences.

Generation via teacher model — practical process

Assuming you have 100–200 quality seed examples and want to expand them.

1. The seed set is the foundation — do not cut it short. Those 150 examples must cover the distribution you want. If the seed set covers only one third of the use-case space, the synthetically expanded dataset will cover that same third — just larger.

2. Prompt engineering for the teacher. The teacher must receive explicit instructions about format, style, domain, and what you want to *prevent*. A vague prompt produces vague data. A good teacher prompt includes: sample input–output pairs from the seed set, the required output format, domain terminology you want to prefer, and negative examples (what to avoid).

3. Generate more than you need — then filter. Generate 3–5× more examples than you plan to use. Then filter: - Automated format check (correct JSON, correct structure) - Embedding-based deduplication (overly similar examples add nothing) - Relevance scoring — either via another model as judge, or via rule-based checkers if you have verifiable outputs - Spot human review of at least 5–10 % of generated examples

4. Mix with real data. The final dataset should contain seed data (100 %) + synthetic data (10–50× more, after filtering). Keep a source identifier in the dataset metadata — you will appreciate it when debugging.

5. Evaluate on a holdout set of real data. This is critical. The eval set must not contain synthetic examples. If you do not evaluate the model against real human judgement, you will never know whether synthetic data introduced drift.

For more on evaluation see How to measure whether fine-tuning helped.

Synthetic data vs model distillation — an important distinction

These terms are mixed in practice, but they are not the same thing.

Model distillation in the original sense trains a smaller model to mimic the output distribution of a larger one. That involves comparing distributions via KL divergence, access to the teacher's logits, and the full spectrum of knowledge distillation techniques from academic literature.

Generating synthetic data from a teacher model is a more pragmatic approach: the teacher model generates text input–output examples, which are used as a standard SFT dataset. You are not using the teacher's logits, you are not computing distributional similarities — you are just generating examples. The result is weaker than full distillation, but achievable without access to the model's internals and without specialised frameworks.

In practice, most "distillation" in commercial projects happens through this second approach — because access to a frontier model's logits is not available via standard APIs. The results are nonetheless demonstrable: see the DeepSeek-R1 distilled models, which transferred reasoning capabilities to 1.5B–8B models through synthetic chain-of-thought data.

For a deeper look at distillation as a technique: Model distillation.

Augmentation vs generation — which to use when

Augmenting existing examples (reformatting, paraphrasing, changing style) is a safer approach than pure generation — it preserves facts from the seed set and only changes form. It is appropriate when:

Your seed data is factually reliable (e.g. technical documentation, your internal processes)
You want to teach the model to respond to different phrasings of the same question
You have no reason to introduce new facts outside the seed set

Pure generation (the teacher model creates entirely new examples) is more powerful but riskier — the teacher can introduce facts not present in the seed set, and you may not catch this without human review.

A combined approach: augmentation for ~60 % of the synthetic dataset, pure generation for ~40 % (to cover long-tail scenarios) — with a higher rate of human review on the generated examples.

When not to use synthetic data

There are situations where synthetic data will not just fail to help, but will actively cause harm:

Facts and precise numerical values. If fine-tuning is meant to teach the model specific product numbers, prices, or technical parameters — the teacher model will invent them. This is a classic hallucination environment. For factual knowledge the correct technique is RAG or continued pretraining on verified texts, not SFT on synthetics.

Regulated domains without expert validation. In legal, medical, or financial contexts, synthetically generated examples can contain factual errors that a real expert would spot in seconds, but which the trained model will replicate with full confidence. If you do not have expert review of every generated example, do not use synthetic data here.

When you have no seed data at all. Synthetic data without a seed dataset is generation from nothing — you get a distribution that reflects the teacher, not your domain. Before generating you must have at least a small, real, well-annotated foundation.

Time-sensitive information. The teacher model has a knowledge cutoff. Synthetic examples about current events, the latest legislation, or current market conditions will be outdated, and you will not know it unless you build a systematic fact-check pipeline.

Filtering and quality gates — concrete steps

Filtering is where the decision is made about whether a synthetic dataset will help or hurt. Minimum quality gate:

1.Format validation — automated, 100 % of examples. Exclude examples with incorrect format, missing fields, or invalid values.
2.Deduplication — embedding-based similarity search; examples with cosine similarity > 0.92 against existing examples should be removed (or one representative kept).
3.Relevance scoring — if you have verifiable outputs (code, JSON, SQL), run a syntax check. If not, use model-as-judge with an explicit rubric; not a generic "is this good?" prompt.
4.Distribution analysis — compare the spread of topics, lengths, and formats in the synthetic dataset vs the seed set. Significant deviations signal drift.
5.Spot human review — minimum 5 % of examples with a rotating criterion (do not always evaluate the same types). Focus on: facts, tone, edge cases.

For more context on why data quality matters more than quantity: 7 reasons fine-tuning fails.

Frequently asked questions

How many synthetic examples can I add to real data without risk?

There is no fixed ratio that holds for every case. A practical reference point: synthetic examples should not make up more than 70–80 % of the total dataset unless you have strong filtering and human review in place. Above that proportion the risk of model collapse grows. Seed data must always be present and must cover the full distribution of the use-case space — not just the common cases.

Can I use ChatGPT / Claude to generate training data for my model?

It depends on the use. The situation for internal enterprise deployment (the model runs on your infrastructure and is not commercially distributed) differs from a commercial product. Always read the current ToS of the specific provider — wording changes. For a commercially clean pipeline we recommend open-weight teacher models (Llama, Qwen, Mistral) with permissive licences.

Is generation via a teacher model the same as model distillation?

No. Distillation in the original sense works with the teacher's logits (probability distributions). Generating synthetic data through a teacher API is a more pragmatic variant — you get text examples, not a distributional signal. Results are weaker than full distillation, but achievable without access to the model's internals. In commercial projects this variant is more common precisely because of its accessibility.

What if the teacher model generates factually incorrect examples?

This is a standard problem and the main argument for spot human review. The teacher model hallucinates — less than small models, but not zero. The solution: verifiable tasks (code, JSON, SQL) can be checked automatically; facts in unstructured text require human review. If you do not have capacity for human review, restrict synthetic data to augmenting existing verified examples — not generating new facts.

Will synthetic data help if the model knows nothing about my domain?

Rarely. Synthetic data can expand and diversify an existing seed set — it cannot replace a foundation of domain knowledge. If the model has no domain base at all, the correct path is continued pretraining on domain texts (manuals, standards, internal documents), and only then SFT — synthetic or real.

*MP Industrial Solutions makes these decisions daily for clients in manufacturing, energy, and logistics. If you are working out what combination of real and synthetic data makes sense for your specific model and use case, we are happy to work through it together.*