7 Reasons Fine-Tuning Fails in Practice

Fine-tuning is one of the most repeated phrases in every AI roadmap document. "We'll train our own model that understands our domain." Elegant in theory. In practice, we've watched the same project unravel at seven different points — and for a different reason each time. This article is not about how to do fine-tuning — it's about why most attempts never reach production, and how to sidestep these pitfalls before they consume your time and budget.

If you're still weighing whether to reach for fine-tuning at all or to go with RAG instead, read RAG vs fine-tuning — making the decision first. The failure modes described below often start exactly there.

---

1. Too little data — or low-quality data

This is the most common cause of failure. The team collects whatever is available — a few hundred internal documents, a CRM export, some email archives — and kicks off training. The result is a model that answers incorrectly with high confidence: worse than the base model, which at least knows how to say "I don't know."

Research has shown (LIMA and related work) that a relatively small number of high-quality examples — on the order of a thousand — can produce a better model than tens of thousands of low-quality ones. Quantity without quality is actively harmful here.

A practical reference scale:

SFT (supervised fine-tuning): minimum ~1,000 high-quality question–answer or task-completion pairs, covering the main topics of your domain. Production systems typically work with 10,000–100,000 examples.
DPO (preference tuning): ~2,000 preference pairs with a human-verified winner/loser.
GRPO (RL-based tuning): ~1,000 scored trajectories with verifiable rewards.

Beyond count, coverage matters — if your KB (knowledge base) lacks sufficient examples for a particular sub-domain, the model will hallucinate in that area. We therefore recommend a topic coverage audit before training begins: list the main question categories the model will need to handle and verify you have adequate representation in each.

---

2. Overfitting — the model only knows what it saw

When the dataset is too small or too narrow and the model is trained on it for too long, overfitting occurs. The model excels on training examples, but fails completely — or hallucinates — at the slightest deviation: a slightly differently worded question, a different context, an unusual input.

Signs of overfitting in practice:

The model literally quotes training samples instead of generalising.
High scores on training data, noticeably lower scores on new examples.
The model refuses to answer out-of-distribution questions rather than acknowledging uncertainty.

Technical countermeasures: monitor validation loss during training and stop when it starts rising (early stopping). Track eval metrics on a hold-out set, not just on training data. For small datasets, regularisation and a lower rank on LoRA adapters (e.g. rank=8 instead of rank=64) is the more appropriate configuration.

---

3. Catastrophic forgetting — the model forgets what it knew

Fine-tuning on narrow domain data can degrade the model's general capabilities — its ability to reason, summarise, think in English or other languages, and apply basic logic. This phenomenon is called catastrophic forgetting and is well documented.

In practice it looks like this: by fine-tuning on internal technical documents you achieve excellent answers to domain-specific questions — but the model stops working on general tasks it handled without issue before. Teams unfamiliar with this phenomenon interpret it as "the model broke."

What mitigates — but does not eliminate — the problem:

LoRA / QLoRA — adapters modify only a small number of parameters while the original weights remain frozen. This is the most effective practical way to limit forgetting.
Merging — the fine-tuned model is merged with the base model using tools such as mergekit (SLERP, TIES). The result balances domain knowledge and general capabilities.
Dataset diversification — including general-purpose examples in the training dataset alongside domain-specific ones.

For regulated industries (medicine, law, pharmaceuticals) forgetting is especially critical — a model that has lost its logic and safety patterns can produce outputs that appear substantive but are factually incorrect.

---

4. Wrong method choice — full fine-tuning where LoRA is enough, or vice versa

Teams new to fine-tuning tend to go to one extreme or the other: either full fine-tuning (which demands enormous GPU resources) or the most frugal option chosen without thinking through the trade-offs.

A rough VRAM requirement guide for a 7B model:

Full fine-tuning (BF16): ~70–120 GB — requires a multi-GPU server
LoRA (16-bit): ~15 GB — A100 or a single RTX 4090 (24 GB VRAM)
QLoRA (4-bit): ~5 GB — fits on an RTX 3090 with 24 GB VRAM

LoRA typically achieves ~90–95 % of full fine-tuning quality at 10–20× lower memory cost. For most domain use-cases this is sufficient — reaching for full fine-tuning without a clear justification is a waste of resources.

On the other hand, there are cases where LoRA is not enough: when the tokeniser changes, when the language has very different morphology from the training data, or for deep "continued pretraining" (building domain knowledge from unlabelled text). Method selection should be driven by the specific objective, not by what happened to run first. A more detailed comparison is in the article LoRA vs QLoRA vs full fine-tuning.

---

5. Fine-tuning instead of RAG — the wrong tool for the wrong job

This is perhaps the most expensive mistake we encounter. A team wants the model to "know" about their products, documents, and internal processes. Fine-tuning seems like the natural answer. They run it, pour the data into training, and a few weeks later discover the model is still hallucinating facts — because fine-tuning does not reliably inject facts into a model.

Fine-tuning is the right tool for:

Changing the format and style of responses (e.g. always returning JSON with a specific schema).
Changing behaviour (e.g. the model should always refuse certain types of requests, or should maintain a specific communication tone).
Adapting to a specialised domain where the base model has insufficient training data (e.g. narrow industrial jargon, proprietary formats).

RAG (Retrieval-Augmented Generation) is the right tool for:

Accessing current or frequently changing information.
Answering questions based on specific documents.
Tracing an answer back to its source (citation, grounding).

In practice: if your goal is "the model should be able to answer questions from our product catalogue" — that is a RAG use-case, not a fine-tuning one. You would apply fine-tuning if you want the model to use your specific response format or industry terminology.

---

6. No eval — we don't know whether we won or lost

A surprisingly common scenario: the team runs fine-tuning, the model trains, it "feels better" on a handful of manually tested examples, and it goes to production. A month later, complaints arrive about regression — the model stops working on cases it previously handled without issue.

Without systematic evaluation (model quality assessment) we do not know:

1.Whether fine-tuning helped at all — comparison against the base model.
2.Whether we introduced a regression — performance on previously working cases.
3.Where exactly the model fails — in which question categories.

A minimal eval framework before production deployment includes:

Hold-out test set — ~10–20 % of data not used in training or validation, measured against after training completes.
Baseline comparison — the same questions put to both the base model and the fine-tuned version. Regression = the fine-tuned model scores lower on cases the base model handled correctly.
Task-specific metrics — not just perplexity (a machine-learning technical metric), but metrics relevant to your use-case: extraction accuracy, format correctness, quality ratings from a domain expert.

More detail on setting up evaluation: How to measure whether fine-tuning helped.

---

7. Unrealistic expectations — fine-tuning is not a magic fix

The last — and in many ways the most important — cause. Fine-tuning is often sold in internal presentations as the solution that turns a generic LLM into an expert on your domain. In practice it is more nuanced:

A fine-tuned 4B–8B model can outperform a generic 70B+ model on a narrowly defined task — but only if the task is genuinely narrow, the data is high quality, and evaluation confirms it.
Fine-tuning does not improve reasoning — if the base model cannot solve a certain class of logical tasks, fine-tuning on domain data will not change that. For reasoning, methods such as GRPO with verifiable rewards are appropriate. More in the article SFT, DPO, GRPO — which method and when.
Fine-tuning is not a one-off project — data changes, models age, regressions accumulate. Without infrastructure for repeatable train-eval-deploy cycles, the project gradually falls apart.
Hallucinations remain — fine-tuning can reduce them within a specific domain, but does not eliminate them. Guardrails, RAG grounding, and human-in-the-loop are still needed wherever correctness matters.

Teams that understand these limits before starting a project end up with usable models. Teams that learn about the limits after six months of development usually cancel the project.

---

Summary: checklist before launching a project

Before deciding to launch a fine-tuning project, we recommend working through these seven questions:

1.Do you have a sufficient dataset? — example count, topic coverage, verified quality.
2.Is fine-tuning the right tool? — or would RAG or prompt engineering be enough?
3.Do you have eval set up? — hold-out set, baseline, task-specific metrics.
4.Do you have GPU infrastructure? — or a realistic cloud training plan with a cost estimate.
5.Are you prepared to iterate? — a single fine-tuning run is not enough; the pipeline must be repeatable.
6.Do you have a domain expert in the loop? — someone who can verify the model answers correctly, not just fluently.
7.Do you have a plan for forgetting and regressions? — monitoring, rollback, eval after every new training run.

If any of these questions lacks a clear answer, the project is not ready to launch — only to prepare.

---

Frequently asked questions

Why does my fine-tuned model perform worse than the base model?

The most common cause is overfitting on a small or low-quality dataset. The model learns "patterns" from the training data, but generalisation to new inputs fails. Solution: improve data quality, use early stopping, reduce the LoRA adapter rank, or reassess the entire project from a RAG vs fine-tuning perspective.

How many examples do I actually need for fine-tuning?

For SFT, functional results are possible from ~1,000 high-quality examples, but production systems typically work with 10,000–100,000. More important than quantity is coverage — if key categories lack sufficient representation, the model will be unreliable in those areas regardless of the total example count.

Is fine-tuning suitable for updating model knowledge (new products, prices, records)?

No. Fine-tuning does not reliably store facts — the model may suggest information from training but will mix it with hallucinations. For knowledge that changes or must be verifiable, the correct tool is RAG with an up-to-date document database.

Can a small fine-tuned model outperform a large generic model?

Yes — under specific conditions. A fine-tuned 4B–8B model can outperform a generic 70B+ model on a narrowly defined task, provided the task is well-bounded, the dataset is high quality, and evaluation confirms it. On broad, general tasks the larger model typically wins.

What is catastrophic forgetting and how do you prevent it?

Catastrophic forgetting is a phenomenon in which fine-tuning on narrow data degrades the model's general capabilities — languages, logic, reasoning. The most effective countermeasure is LoRA or QLoRA, which modify only a small number of parameters and preserve the original weights. Merging the fine-tuned model with the base model using a tool such as mergekit provides additional benefit.

---

*MP Industrial Solutions helps companies determine when fine-tuning genuinely makes sense and when a simpler, less costly path exists. If you are considering domain adaptation of an LLM, we are happy to assess your use-case — from dataset to infrastructure and evaluation.*

---

1. Too little data — or low-quality data

A practical reference scale:

SFT (supervised fine-tuning): minimum ~1,000 high-quality question–answer or task-completion pairs, covering the main topics of your domain. Production systems typically work with 10,000–100,000 examples.
DPO (preference tuning): ~2,000 preference pairs with a human-verified winner/loser.
GRPO (RL-based tuning): ~1,000 scored trajectories with verifiable rewards.

---

2. Overfitting — the model only knows what it saw

Signs of overfitting in practice:

The model literally quotes training samples instead of generalising.
High scores on training data, noticeably lower scores on new examples.
The model refuses to answer out-of-distribution questions rather than acknowledging uncertainty.

---

3. Catastrophic forgetting — the model forgets what it knew

What mitigates — but does not eliminate — the problem:

LoRA / QLoRA — adapters modify only a small number of parameters while the original weights remain frozen. This is the most effective practical way to limit forgetting.
Merging — the fine-tuned model is merged with the base model using tools such as mergekit (SLERP, TIES). The result balances domain knowledge and general capabilities.
Dataset diversification — including general-purpose examples in the training dataset alongside domain-specific ones.

---

4. Wrong method choice — full fine-tuning where LoRA is enough, or vice versa

A rough VRAM requirement guide for a 7B model:

Full fine-tuning (BF16): ~70–120 GB — requires a multi-GPU server
LoRA (16-bit): ~15 GB — A100 or a single RTX 4090 (24 GB VRAM)
QLoRA (4-bit): ~5 GB — fits on an RTX 3090 with 24 GB VRAM

---

5. Fine-tuning instead of RAG — the wrong tool for the wrong job

Fine-tuning is the right tool for:

Changing the format and style of responses (e.g. always returning JSON with a specific schema).
Changing behaviour (e.g. the model should always refuse certain types of requests, or should maintain a specific communication tone).
Adapting to a specialised domain where the base model has insufficient training data (e.g. narrow industrial jargon, proprietary formats).

RAG (Retrieval-Augmented Generation) is the right tool for:

Accessing current or frequently changing information.
Answering questions based on specific documents.
Tracing an answer back to its source (citation, grounding).

---

6. No eval — we don't know whether we won or lost

Without systematic evaluation (model quality assessment) we do not know:

1.Whether fine-tuning helped at all — comparison against the base model.
2.Whether we introduced a regression — performance on previously working cases.
3.Where exactly the model fails — in which question categories.

A minimal eval framework before production deployment includes:

Hold-out test set — ~10–20 % of data not used in training or validation, measured against after training completes.
Baseline comparison — the same questions put to both the base model and the fine-tuned version. Regression = the fine-tuned model scores lower on cases the base model handled correctly.
Task-specific metrics — not just perplexity (a machine-learning technical metric), but metrics relevant to your use-case: extraction accuracy, format correctness, quality ratings from a domain expert.

More detail on setting up evaluation: How to measure whether fine-tuning helped.

---

7. Unrealistic expectations — fine-tuning is not a magic fix

A fine-tuned 4B–8B model can outperform a generic 70B+ model on a narrowly defined task — but only if the task is genuinely narrow, the data is high quality, and evaluation confirms it.
Fine-tuning does not improve reasoning — if the base model cannot solve a certain class of logical tasks, fine-tuning on domain data will not change that. For reasoning, methods such as GRPO with verifiable rewards are appropriate. More in the article SFT, DPO, GRPO — which method and when.
Fine-tuning is not a one-off project — data changes, models age, regressions accumulate. Without infrastructure for repeatable train-eval-deploy cycles, the project gradually falls apart.
Hallucinations remain — fine-tuning can reduce them within a specific domain, but does not eliminate them. Guardrails, RAG grounding, and human-in-the-loop are still needed wherever correctness matters.

Teams that understand these limits before starting a project end up with usable models. Teams that learn about the limits after six months of development usually cancel the project.

---

Summary: checklist before launching a project

Before deciding to launch a fine-tuning project, we recommend working through these seven questions:

1.Do you have a sufficient dataset? — example count, topic coverage, verified quality.
2.Is fine-tuning the right tool? — or would RAG or prompt engineering be enough?
3.Do you have eval set up? — hold-out set, baseline, task-specific metrics.
4.Do you have GPU infrastructure? — or a realistic cloud training plan with a cost estimate.
5.Are you prepared to iterate? — a single fine-tuning run is not enough; the pipeline must be repeatable.
6.Do you have a domain expert in the loop? — someone who can verify the model answers correctly, not just fluently.
7.Do you have a plan for forgetting and regressions? — monitoring, rollback, eval after every new training run.

If any of these questions lacks a clear answer, the project is not ready to launch — only to prepare.

---

Frequently asked questions

Why does my fine-tuned model perform worse than the base model?

How many examples do I actually need for fine-tuning?

Is fine-tuning suitable for updating model knowledge (new products, prices, records)?

Can a small fine-tuned model outperform a large generic model?

What is catastrophic forgetting and how do you prevent it?

---

Six pillars,one delivery.

Industry & engineering

Electrical & automation

Automation & Control

Data centres & server rooms

AI, software & cloud

Smart home & IoT

7 Reasons Fine-Tuning Fails in Practice

1. Too little data — or low-quality data

2. Overfitting — the model only knows what it saw

3. Catastrophic forgetting — the model forgets what it knew

4. Wrong method choice — full fine-tuning where LoRA is enough, or vice versa

5. Fine-tuning instead of RAG — the wrong tool for the wrong job

6. No eval — we don't know whether we won or lost

7. Unrealistic expectations — fine-tuning is not a magic fix

Summary: checklist before launching a project

Frequently asked questions

Why does my fine-tuned model perform worse than the base model?

How many examples do I actually need for fine-tuning?

Is fine-tuning suitable for updating model knowledge (new products, prices, records)?

Can a small fine-tuned model outperform a large generic model?

What is catastrophic forgetting and how do you prevent it?

7 Reasons Fine-Tuning Fails in Practice

1. Too little data — or low-quality data

2. Overfitting — the model only knows what it saw

3. Catastrophic forgetting — the model forgets what it knew

4. Wrong method choice — full fine-tuning where LoRA is enough, or vice versa

5. Fine-tuning instead of RAG — the wrong tool for the wrong job

6. No eval — we don't know whether we won or lost

7. Unrealistic expectations — fine-tuning is not a magic fix

Summary: checklist before launching a project

Frequently asked questions

Why does my fine-tuned model perform worse than the base model?

How many examples do I actually need for fine-tuning?

Is fine-tuning suitable for updating model knowledge (new products, prices, records)?

Can a small fine-tuned model outperform a large generic model?

What is catastrophic forgetting and how do you prevent it?