Which open-weight model do you recommend as a base for domain SFT?

For most domain projects in 2026, good foundations are models from the Qwen, Llama, or Mistral families in the 7B–14B range. The choice depends on context length, licence, and which base model is compatible with your training framework. For specific recommendations with numbers, see [How to Choose an LLM Model](/en/blog/vyber-llm-modelu-2026). *If you are considering fine-tuning your own model and are not sure where to start — SFT, DPO, or another method — we are happy to walk through your specific use case and propose a realistic plan. Contact us at [mp-is.eu](https://mp-is.eu) or book a consultation directly.*

SFT, DPO, GRPO: Three Ways to Fine-Tune a Model — and When to Use Each

When a company starts fine-tuning its own language model, the first question is usually "How much data do we need?" The better question is: "What do we want to change in the model — and which training objective matches that goal?" SFT, DPO, and GRPO are three different answers to three different problems. Choosing the right method before you start collecting data determines whether the project works six months from now or not.

This article does not explain how to install a training framework. It explains what each method does, when to deploy it, how much data you actually need, and why the order SFT → DPO → GRPO is not a coincidence.

The basics: what LoRA is and what a training objective is — two distinct concepts

Before comparing the methods themselves, it is important to distinguish two things that are often conflated in discussions.

LoRA (Low-Rank Adaptation) and QLoRA (its quantisation-compressed variant) are *mechanisms* — a way to physically modify a model without updating all its weights. Instead, you train small adapter matrices that are applied on top of the existing model weights. This allows you to fit into much less GPU memory: a typical 7B model can be fine-tuned with QLoRA on a GPU with ~9 GB VRAM, whereas full fine-tuning would require ~70–120 GB. A more detailed comparison of these mechanisms can be found in the article LoRA vs QLoRA vs Full Fine-Tuning.

SFT, DPO, and GRPO, by contrast, are *training objectives* — they define what the model learns. Just like LoRA, full fine-tuning is also a mechanism. You can do SFT with LoRA, SFT with full fine-tuning, DPO with QLoRA — the combinations are freely interchangeable. In practice, most domain projects today use LoRA or QLoRA for purely economic reasons, but the training objective remains the more fundamental decision.

SFT — supervised fine-tuning: teaching the model format and behaviour

SFT (supervised fine-tuning) is the foundational method. You give the model an input and the desired output, and the model learns to imitate these pairs. It is essentially an extension of pre-training: the model learns from samples in the format (prompt → answer).

SFT addresses the question: *"How should the model respond to this type of task?"*

When to use SFT

SFT is the right choice when:

The model has the right knowledge but responds in the wrong format (too long, too brief, wrong language, wrong structure)
You want to teach the model domain jargon, terminology, or a communication style
You have a clearly defined task with consistent patterns — for example, document classification, entity extraction, or report generation from a template
You want to distil the behaviour of a stronger model into a smaller one (a teacher model generates the examples; a student model trains on them)

SFT also serves as the foundation for all subsequent methods. A base model that has not gone through SFT cannot be reliably refined further with DPO or GRPO — I will return to this below.

How much data SFT needs

Research has shown that 1,000 high-quality examples can produce significantly better output than 100,000 low-quality samples. For production systems, however, a typical dataset size is more like 10,000–100,000 pairs, because you want to cover the long tail of input variants that appear in real-world operation.

The practical minimum for trustworthy results in a domain project is around 5,000 high-quality examples covering most of the topic areas the model will encounter. Below this threshold, the model may improve on the data it has seen but fails on variants it has not.

For regulated industries (law, medicine, pharmaceuticals), a stricter rule applies: the dataset must cover every jurisdiction and every document type the model will work with. Partial coverage produces a model that answers with high confidence even in areas where it lacks sufficient training examples — which is worse than not addressing the problem at all.

What SFT does not solve

SFT does not teach the model *judgement* — it does not know that one answer is better than another, only that such an answer exists. If the model tends to respond in an unhelpful way, to be too brief where that causes harm, or to avoid certain types of questions, SFT alone will not fix that. That is what DPO is for.

DPO — direct preference optimization: teaching the model what is better

DPO (direct preference optimization) training works through preference pairs — for each prompt you have two responses: a winner (preferred) and a loser (less preferred). The model learns to shift its response distribution toward the winner and away from the loser.

DPO is a simplified variant of the original RLHF (reinforcement learning from human feedback) — it requires no separate reward model, making it far cheaper and more stable to train.

DPO addresses the question: *"How should the model decide when multiple possible answers exist?"*

When to use DPO

DPO is the right choice when:

You have defined preferences: what constitutes a better answer versus a worse one (human-verified or validated through an automated process)
You want to reduce the model's tendency to respond in a particular undesirable way — too passively, too verbosely, with too many hedging phrases
You want to fine-tune the communication tone without completely rewriting the training data
You already have an SFT baseline and want to further improve model alignment

How much data DPO needs

The minimum recommended quantity is ~2,000 preference pairs with human-verified winner/loser judgements. This is not an arbitrary number — below this threshold the preference signal cannot be reliably separated from noise in the distribution, and the model may overfit to the artefacts of specific annotators.

For good generalisation you want broader coverage: 5,000–10,000 pairs spanning different prompt types and scenarios is a common production target.

More important than the number is the quality of the judgement. If winner/loser pairs are rated inconsistently or from unclear criteria, the model learns an inconsistent policy. Before collecting data, it is essential to have a clear rubric — what specifically makes one answer better.

Order: SFT before DPO is mandatory

DPO is applied to a model that has already gone through SFT — not to a base model directly. The reason is practical: a base model produces too much variance in its outputs, and the DPO signal "disperses" — the model has no stable behavioural baseline against which the preference gradient can take effect.

In practice, the sequence looks like this:

1.Base model → SFT → instruction-tuned model (can respond consistently to given task types)
2.Instruction-tuned model → DPO → aligned model (prefers better answers over worse ones)

Skipping SFT and applying DPO directly from a base model typically produces unstable results or a model that does not follow instructions.

GRPO — group relative policy optimization: teaching the model to reason

GRPO (group relative policy optimization) belongs to the family of RL-from-rewards (reinforcement learning with rewards). Instead of preference pairs, the model is given a verifiable task — a mathematical equation, a logic problem, a coding task — and receives a reward based on whether its output is objectively correct.

GRPO gained prominence after the release of DeepSeek R1, where it was used for reasoning-oriented fine-tuning. Its key advantage over the older PPO (proximal policy optimization) is that GRPO requires no separate critic model, which reduces VRAM requirements and simplifies the training pipeline.

GRPO addresses the question: *"How do you teach a model to reason better on tasks where the correct answer is verifiable?"*

When to use GRPO

GRPO is the right choice when:

You have tasks with verifiable answers — mathematics, code, logic, SQL queries, structured data extraction with a gold annotation
You want to improve reasoning — the model's ability to work through multiple steps without losing context
You want the model to generate chain-of-thought on tasks where that is valuable
You have an environment where output correctness can be evaluated automatically without a human annotator

GRPO is typically the third step in the pipeline, not the first. The model must have a solid SFT foundation and ideally DPO alignment before RL training can be applied effectively.

How much data GRPO needs

The minimum is ~1,000 scored trajectories — prompts with a verifiable answer and a functional reward signal. The emphasis is on "verifiable": the reward must be consistent and automatically computable. If the reward depends on subjective judgement, RL training produces unstable results.

In practice, GRPO is applied to smaller, targeted datasets (thousands, not hundreds of thousands) — because the reward signal is more intensive than a supervised signal. On the other hand, gathering verifiable rewards is expensive: you need to define a metric, write an evaluator, and ensure the evaluator itself does not make errors.

The experimental present of GRPO

GRPO is an active research space. Multiple variants exist (DAPO and others) and the community is actively exploring precisely where its limits lie. For most domain projects in a B2B setting, GRPO is only relevant if:

You are working on reasoning-heavy tasks (complex analysis, multi-line code, technical diagnostics)
You have the capacity to write and validate a reward function
The team has experience with RL training — debugging RL is significantly more complex than debugging SFT

For most domain adaptations (style, terminology, format), SFT + DPO is sufficient and far more stable.

Three methods side by side — a quick comparison

SFT — supervised fine-tuning: - Input: (prompt, desired answer) pairs - Teaches: format, style, terminology, behaviour on specific tasks - Data minimum: ~5,000 high-quality pairs for a domain production system - Order: always the first step

DPO — direct preference optimization: - Input: (prompt, winner answer, loser answer) triples - Teaches: preferred responses, alignment, improved tone and decision-making - Data minimum: ~2,000 human-verified preference pairs - Order: after SFT, not from a base model

GRPO — group relative policy optimization: - Input: prompt + automatic reward signal (verifiable correctness) - Teaches: reasoning, chain-of-thought, accuracy on verifiable tasks - Data minimum: ~1,000 scored trajectories with a functional evaluator - Order: after SFT (and ideally DPO) as the third step

Catastrophic forgetting — the hidden cost of every fine-tuning run

Every method carries the risk of catastrophic forgetting: a model that is intensively trained on a narrow domain can degrade on capabilities it did not see in the training data. In practice this means: a model that excels at generating technical reports may start struggling with conversational questions or logical reasoning outside the domain.

PEFT mechanisms like LoRA mitigate this effect because they modify only a small portion of the weights — but they do not eliminate it. Practical mitigation steps:

1.Mix domain data with general-purpose samples in the training dataset (5–15% general mix)
2.After every training run, evaluate the model outside the domain — not just on domain benchmarks
3.Keep the pre-fine-tuning version of the model as a fallback

A more detailed look at how to measure whether fine-tuning helped or hurt can be found in the article How to Measure Whether Fine-Tuning Helped.

The pipeline in practice: from base model to production deployment

In B2B projects where we have deployed a domain model, the typical process looks like this:

Phase 1 — choosing the base. You select a suitable open-weight model (Qwen, Llama, Mistral family) based on VRAM capacity and the required context length. For most domain tasks, a 7B–14B model offers the optimal performance-to-cost ratio. If you have a GPU with 24 GB VRAM (e.g. RTX 3090/4090), QLoRA on a 7B model runs comfortably; a 13B model fits, but tightly. More on model selection and GPU sizing can be found in the article Which GPU for LLM Inference.

Phase 2 — collecting SFT data. You identify the task types, the format of the required responses, and the terminology. You collect or generate 5,000–50,000 pairs. For domain projects, a good recipe is: 150–200 high-quality human-seed examples, expanded 10–100× using a strong frontier model (Claude, GPT as teacher). The result is verified by manually annotating a sample.

Phase 3 — SFT run. Training with LoRA or QLoRA, typically for several epochs. On an A100 GPU this takes hours, not days. Rough cloud costs run in the tens of euros per run for a 7B model on 10K examples — depending on the provider and GPU used.

Phase 4 — evaluation and decision. A test set covering all task types. If the results are satisfactory, the model goes to production. If not — you analyse where it fails, rather than blindly adding more data.

Phase 5 (optional) — DPO. If you have the capacity to collect preference pairs and the model has specific behaviour you want to change (not just missing knowledge), DPO is the correct next step.

Phase 6 (specialised) — GRPO. Only if you are working on a reasoning-heavy use case and have a verifiable reward signal.

When not to fine-tune

Fine-tuning is not the answer to every problem. We have seen projects where a company invested weeks in SFT and the result was worse than a simple RAG pipeline with a good prompt. Before fine-tuning, ask yourself:

Is the problem that the model does not know something (facts, documents) — if so, RAG is more effective and cheaper. Fine-tuning does not reliably teach a model new facts; it only changes behaviour.
Is the problem that the model responds poorly in terms of format or style — if so, a better system prompt may be sufficient before investing in a dataset.
Do you have enough high-quality data to cover the domain — if not, fine-tuning produces a model that answers confidently even where it has no grounding.

The decision framework for RAG vs fine-tuning is explored in more detail in the dedicated article RAG vs Fine-Tuning — When to Use Which.

Frequently asked questions

Can I apply DPO directly from a base model without SFT?

Technically yes; the result is usually unstable. A base model produces outputs with too much variance — the DPO gradient cannot take effect efficiently because the model has no consistent behavioural baseline. In practice you almost always need at least a minimal SFT pass before DPO.

Is GRPO suitable for enterprise projects outside of technology?

GRPO is strong where you have verifiable answers — mathematics, code, structured extractions with gold annotations. For most B2B use cases (customer support, documentation assistant, reporting), SFT + DPO is sufficient and far simpler to implement and debug. We recommend GRPO only if the team has experience with RL training.

How much does cloud fine-tuning cost for a 7B model?

A rough estimate: an SFT run on 10,000 examples takes on the order of hours on an A100 GPU, with costs in the tens of euros (on cheaper providers) to the low hundreds of euros (hyperscalers). The real project cost depends on the number of iterations, dataset size, and how many times training is repeated after data revisions. The larger cost is typically data collection and annotation, not the training run itself.

What is catastrophic forgetting and how do you prevent it?

Catastrophic forgetting occurs when fine-tuning on a narrow domain degrades the model's general capabilities — for example, logical reasoning or conversational ability outside the domain. You mitigate it by mixing domain data with general-purpose samples (5–15% general mix in the dataset), using LoRA/QLoRA (less aggressive weight modification), and evaluating outside the domain after every training run.

For most domain projects in 2026, good foundations are models from the Qwen, Llama, or Mistral families in the 7B–14B range. The choice depends on context length, licence, and which base model is compatible with your training framework. For specific recommendations with numbers, see How to Choose an LLM Model.

*If you are considering fine-tuning your own model and are not sure where to start — SFT, DPO, or another method — we are happy to walk through your specific use case and propose a realistic plan. Contact us at mp-is.eu or book a consultation directly.*