A company deploys RAG over its internal documentation, results are acceptable, but the model still makes strange errors — it doesn't understand abbreviations, it conflates domain-specific terms, and its response phrasing feels generic. The team tries SFT (supervised fine-tuning) on a few hundred examples. There is some improvement, but deep domain knowledge is still missing. Someone suggests: "What if we taught the model on all the company documentation? All 800 PDFs?"
That is precisely the situation where continued pretraining comes into play — a method that sits between training a model from scratch and classical domain fine-tuning. It is more powerful, more expensive, and less well understood than SFT. And that is exactly why it is worth understanding before you deploy it or rule it out.
What continued pretraining is — and what it is not
Continued pretraining (sometimes also called domain-adaptive pretraining, DAPT, or second-stage pretraining) is the process of taking a finished pretrained model and continuing to train it in the style of pretraining — that is, on a large corpus of unlabelled text, using causal language modelling (next-token prediction).
The difference from classical fine-tuning (SFT) is fundamental:
- SFT trains the model on labelled input–output pairs (question–answer, instruction–result). It teaches the model *how to behave* and *how to respond*. It requires relatively small but carefully labelled datasets.
- Continued pretraining trains the model on raw text without structured responses. It teaches the model *what it knows* — the language distribution, concepts, patterns, and factual knowledge of a specific domain. It requires large amounts of text, but without manual labelling.
Another common confusion: continued pretraining is not the same as full fine-tuning. Full fine-tuning describes *how many weights are trained* (all of them, as opposed to LoRA). Continued pretraining describes *what type of training* takes place (further pretraining vs. instruction fine-tuning). You can do continued pretraining via LoRA, QLoRA, or full parameter update — these are orthogonal dimensions.
When continued pretraining makes sense
Not every domain problem calls for continued pretraining. From practice, we see it making sense in three main scenarios:
1. A new domain with high technical density and its own language
When your domain uses terminology, abbreviations, and phrases that are rare or absent in standard pretraining text. Examples: industrial documentation with tens of thousands of abbreviations, specific medical sub-specialties, corporate regulatory language, technical standards. After SFT the model may know how to *format* responses, but it does not deeply understand the concepts — and that shows when concrete questions are asked.
2. A lack of labelled data, but an abundance of raw text
Labelling data for SFT is expensive and slow. If you have 500 GB of technical documentation but only 500 quality-labelled Q&A examples, continued pretraining lets you use the entire text corpus without needing to label it. SFT then follows as a "behaviour calibration" step on top of that new knowledge base.
3. A different language or language mixing
Most popular open-weight models are predominantly English. If you need strong capabilities in Slovak, German, Czech, or other languages with lower representation in the original pretraining, continued pretraining on a large language corpus will significantly improve capabilities in that language — including grammar, idioms, and cultural context.
Conversely, continued pretraining *does not make sense* when: - You have enough quality Q&A examples and the domain is not radically different from the original pretraining data - Your primary need is a change in *behaviour* (format, tone, refusals, response structure) — that is work for SFT or DPO - You have a limited compute budget and need a fast result
A typical pipeline: what it looks like in practice
Once you decide to pursue continued pretraining, you are looking at a multi-phase process. Cutting corners leads to trouble.
Phase 1 — Corpus preparation
This is the longest and most important phase. The source text must be cleaned: no duplicates, no OCR artefacts, no irrelevant content (footers, navigation, form fields). N-gram-level deduplication is also recommended — the model should not see the same sentence a hundred times, because that leads to memorisation rather than generalisation.
Corpus size: for a meaningful effect, in practice we are talking about *hundreds of millions of tokens*, ideally *billions*. A standard technical PDF document contains roughly 50–200 thousand tokens once processed. Five hundred such documents is therefore in the order of 25–100 million tokens — which is at the lower boundary of meaningful continued pretraining.
An important detail: mixed pretraining is better than purely domain-specific training. If you train only on domain texts, the model will forget its general-purpose capabilities and its language will "stiffen" to a single register. A good recipe is a mix of 80–90% domain data and 10–20% general text (for example, a quality web corpus). This mitigates catastrophic forgetting.
Phase 2 — Training configuration
The key difference from SFT: the learning rate must be significantly lower than in pretraining from scratch. Values in the order of 1e-5 to 1e-4 are typical — roughly 10–100× lower than in the original pretraining. A learning rate that is too high and the model "forgets" what it knew; too low and there is no domain adaptation.
On training architecture: in practice most teams reach for LoRA or QLoRA even for continued pretraining, because a full parameter update is extremely costly. LoRA works for continued pretraining — not quite as well as a full-weight update, but sufficient in practice for most domain adaptations.
Phase 3 — SFT as the finishing step
Continued pretraining on its own does not produce a "conversational" model. It produces a model that deeply understands the domain but can only generate free-form text. That is why SFT (and sometimes DPO) almost always follows continued pretraining, so the model learns to respond correctly to instructions, answer questions, and follow a format. More on this pipeline can be found in the article on choosing between SFT, DPO, and GRPO.
Costs and hardware requirements
Here it is important to be honest: continued pretraining is *more expensive* than SFT — noticeably so.
With SFT on 10,000 examples, you are looking at hours of training on a single A100 — in the order of tens of dollars. Continued pretraining on 500 million tokens can take days even on multiple GPUs — and that includes LoRA. A full parameter update on a billion-token corpus may require tens of A100-hours, which in the cloud translates to costs in the hundreds to thousands of euros.
For rough practical reference: - LoRA continued pretraining of a 7B model on ~200M tokens: roughly 10–30 hours on a single A100 80 GB - QLoRA 4-bit continued pretraining of a 7B model on the same corpus: longer (dequantisation slows things down), but fits on a consumer GPU with 24 GB VRAM - Full-parameter continued pretraining of a 13B model: multi-GPU is almost unavoidable
A100 cloud pricing today runs from roughly $0.60/hr (spot prices at smaller providers) to ~$3–4/hr at large hyperscalers. Budget a buffer — the first run usually exposes data problems and the training has to be restarted.
Risks we see in practice
Catastrophic forgetting is a real threat. After overly aggressive continued pretraining, a model can degrade on general capabilities — worse English, worse instruction-following, worse reasoning. Mitigations: low learning rate, mixed training (see above), and optionally regularisation. Evaluate not just domain performance, but general benchmarks before and after training.
"The model is just memorising" — if the corpus is small and contains many repeated documents, the model learns to quote text instead of understanding it. Deduplication and data diversity are mandatory.
Naïve data = naïve model — by 2026 a substantial share of new web text is AI-generated, which when web-crawling brings the risk of training on synthetic, distributed generation. For industrial corporate documents this is typically not a problem, but be careful when collecting external sources.
Regulatory and data risks — continued pretraining on corporate documentation can inadvertently "unlearn" guardrails from the original instruction-tuned model. After continued pretraining, SFT and alignment fine-tuning (DPO or GRPO) must follow to restore these mechanisms. If you skip that step, you have a model without built-in safety behaviours. For regulated industries, this is critical.
We write in more detail about other reasons why domain adaptation fails in the article on the most common causes of failed fine-tuning.
The alternative: when RAG replaces continued pretraining
For many companies, continued pretraining is the wrong answer to the right question. When the goal is for the model to "know the content of documents", RAG (Retrieval-Augmented Generation) solves that more cheaply, more quickly, and with better updateability. The model does not need to "know" the documents — it just needs to receive them in context at the time of the query.
Continued pretraining is the better choice when: - The domain language is so specific that retrieval is not enough (the model does not understand the concepts even when they are provided in context) - You need low latency without a retrieval step (edge deployment, real-time processes) - You want the model to *generate* content in the domain language, not just answer questions
RAG and continued pretraining are not mutually exclusive — the best practical results come from combining them: a domain-adapted model with RAG over current documentation. More on the decision between these approaches can be found in the RAG vs fine-tuning comparison.
A practical decision framework
When a client comes to us with the question "should we further-train the model or not", we work through these steps:
- 1.What exactly is not working? — bad format/tone → SFT; model doesn't understand concepts → continued pretraining; model cites outdated information → RAG; model hallucinates facts → a combination
- 2.How much unlabelled text do you have? — fewer than 10 million tokens? Continued pretraining probably isn't worth it; 100M+ tokens? Worth considering
- 3.What is your compute budget? — if you don't have access to A100+ or cloud GPUs, start with SFT; continued pretraining is for teams with established ML infrastructure
- 4.Is the domain genuinely specialised? — if your technical language also appears in ordinary web text, SFT on good examples will be sufficient
Frequently asked questions
Is continued pretraining the same as domain-adaptive pretraining?
Essentially yes — the terms are used interchangeably. "Domain-adaptive pretraining" (DAPT) is the academic term from the research community; "continued pretraining" is more common in industry. Both describe the same thing: continuing pretraining on a domain-specific corpus of unlabelled text.
Can I do continued pretraining with LoRA, or does it require a full parameter update?
LoRA (and QLoRA) work for continued pretraining, and most teams prefer them precisely because of the memory savings. A full parameter update yields slightly better results, but the difference is usually smaller than the cost difference. For most domain adaptations, LoRA is sufficient.
How much text do I need for a meaningful effect?
From practice: below 50 million tokens the effect is mostly marginal. Pronounced domain adaptation starts to show from hundreds of millions of tokens. If you have less, invest instead in the quality of your SFT data — you will likely get a better result at a lower cost.
Will the model lose the ability to follow instructions after continued pretraining?
Yes — and this is a common trap. Continued pretraining is typically done on a *base* model (not an instruction-tuned variant), or if done on an instruction-tuned model, you risk weakening its built-in behaviours. That is why SFT (and optionally DPO) must follow continued pretraining as a mandatory phase. Never put a continued-pretrained model directly into production without an instruction fine-tuning layer.
Is continued pretraining suitable for small models (1B–4B)?
Yes, and sometimes it is even more effective. Small models have limited general-knowledge capacity, so the domain "overwriting" is proportionally larger. A fine-tuned 4B model in a narrow domain can outperform a generic model an order of magnitude larger within that domain. More on this topic can be found in the comparison of a small fine-tuned model vs a large base model.
Conclusion
Continued pretraining is not a universal answer to domain LLM problems — but for companies with extensive technical documentation, highly specific professional language, or the need for deep language adaptation, it is a tool that SFT simply cannot replace. The key is knowing when to use it: when the model does not understand the domain at a fundamental level, not merely when it fails to format responses correctly.
*If you are considering whether continued pretraining is right for your use case, or you are looking for a framework to decide between RAG, SFT, and domain adaptation, we are happy to work through it together. MP Industrial Solutions has experience deploying local LLMs in industrial environments — including cases where we recommended stepping back from continued pretraining and choosing a simpler solution instead.*
