More and more companies are running into the same situation: a frontier model works brilliantly, but in production it is too expensive or too slow. A 3–5 second latency is unacceptable for inline recommendations inside an MES system. The cost of thousands of calls to a large API model adds up month after month to figures a controller refuses to approve. And deploying on an edge device with limited VRAM simply isn't on the table.
This is exactly where model distillation (knowledge distillation) enters the picture. It is not a new technique — it emerged in the context of classification networks more than a decade ago — but in the era of large language models it is experiencing a renaissance and has become one of the key tools for production deployment. This article explains how distillation works, where it differs from quantisation, when it pays off, and what to realistically expect from it.
What distillation is — and what it is not
Distillation is knowledge transfer from one model (the teacher) to another, smaller one (the student). The teacher was trained for a long time on a lot of data — it has developed internal representations and capabilities that you cannot extract directly from training labels alone. The student learns not only from final answers but also from *how* the teacher reasons.
An important distinction that is frequently confused in practice:
Distillation ≠ quantisation. Quantisation is a compression technique — you represent the original weights at lower numerical precision (for example, from FP16 to 4-bit integer via the .gguf format). The model stays the same; it simply takes up less space and infers faster, typically losing ~1–3 % of quality on benchmarks. Quantisation changes neither the architecture nor the parameter count.
Distillation changes both. The student is a different model with fewer parameters and potentially a different architecture. The goal is not to compress the teacher but to transfer its capabilities into a smaller structure.
Distillation ≠ synthetic data. When you use a frontier model to generate training examples for a smaller model, you are creating synthetic data — not performing classical distillation in the technical sense. In practice these approaches are often combined, but the mechanism is different — covered in more depth in the article on synthetic data for fine-tuning.
Two fundamental types of distillation
Response-based distillation (output-level)
The simplest approach. The student trains on soft labels — the full probability distributions the teacher produces at its output (logits or softmax distributions), not just the hard "correct/incorrect" answer.
Why are soft labels more valuable than hard ones? When the teacher sees a question about diagnosing a technical problem, its output distribution might say: "60 % probability A, 25 % probability B, 15 % probability C." That reflects uncertainty and the relationship between options. A hard label would simply be "A." Training on soft labels gives the student a denser learning signal.
In practice for LLMs this means the student observes, token by token, how the teacher distributes probabilities, and tries to imitate those distributions — not just reproduce the final text.
Feature-based distillation (internal-representation level)
A more sophisticated approach. The student tries to reproduce not only the outputs but also the internal states of the teacher — hidden-layer activations, attention patterns, representations in the embedding space.
Advantage: it transfers a deeper structure of knowledge. Disadvantage: it requires the teacher and student to have a sufficiently compatible architecture, which complicates implementation when the size difference is large. In practice, feature-based distillation is most commonly used when training models of similar architecture where the teacher is 2–4× larger.
Modern libraries combine both approaches. The standard training objective for distillation in TRL or Axolotl typically includes a combination of logit loss (KL divergence between teacher and student distributions) and ground-truth label loss (standard cross-entropy).
When distillation is worth it
Distillation is not the right fit for every scenario. We have seen projects where choosing the right approach upfront saved months of work. Three situations where distillation clearly wins:
Latency and edge deployment. If the model must run locally on a device with 4–8 GB VRAM — an industrial terminal, an embedded controller, a mobile application — a frontier model is simply not an option. A well-distilled model in the 1B–4B range can, on a narrow domain, deliver results that are entirely sufficient for the use case. Example: a language model for classifying error messages from SCADA systems does not need the general knowledge of a 70B model, but it must be fast and accurate on that specific domain.
Cost at high call volumes. If your application calls an LLM thousands or tens of thousands of times per day, the price difference between calling a frontier API and running inference on your own 7B model is an order of magnitude. Distilling from an expensive frontier teacher into a cheaply-served student is the standard production pattern here.
Regulated or air-gapped environments. Data that cannot leave your perimeter requires a local model. If your domain expert is a frontier model with a cloud API (for example for annotating training data), distillation transfers its knowledge into a model you can deploy on-premises. More on the requirements of regulated environments in the article On-prem LLM for regulated industries.
When distillation is not enough: if your use case demands general reasoning, complex multi-step reasoning, or working with long context windows, a small student will not compete with a large model regardless of how good the distillation is. Distillation transfers capabilities, but it does not give the student a different architecture.
Realistic quality expectations
This is where we see the biggest gap between marketing claims and production reality.
What distillation realistically achieves:
A well-distilled student on a narrow domain (technical documentation, classification, structured data extraction) can reach 85–95 % of the teacher's quality on that specific domain, at 5–20× smaller size. In early 2025 DeepSeek released a series of distilled models (including versions in the 1.5B–8B range) from their larger reasoning model, successfully transferring chain-of-thought reasoning into significantly smaller architectures while preserving most of the performance on mathematical and coding tasks.
What distillation will not preserve:
The teacher's general capabilities transfer poorly. A student distilled on technical documentation will perform worse at writing marketing copy or resolving ethical dilemmas. This is a feature, not a bug — specialisation is the intent — but you need to be aware of it when designing the system.
Long context and complex reasoning are another area where small students lose ground. A teacher with a 1M-token context window transfers only a fraction of that capability into a student with a 128K context window and fewer parameters.
Practical rule of thumb: you can distil such that the student is substantially better than a base model of that size — but you cannot distil such that the student is as good as the teacher in general. The goal is targeted excellence, not general equivalence.
Relationship to fine-tuning and synthetic data
Distillation, fine-tuning, and synthetic data are complementary techniques, not alternatives. A typical production pipeline looks like this:
- 1.Teacher generates training data — a frontier model annotates, answers, and evaluates on your domain. This is a combination of distillation (the teacher produces logits or soft labels) and synthetic data generation (the teacher generates texts that become training examples).
- 2.Student trains on this data — via standard SFT (Supervised Fine-Tuning) or with an explicit distillation loss function where the student imitates the teacher's distributions.
- 3.Optionally: alignment — DPO or GRPO on top of the distilled student, if you need to further tune its behaviour to match preferences.
An important detail: if the teacher generates answers and the student trains only on the final texts (without access to logits), this is technically training on synthetic data, not distillation in the strict sense. Results can be similar, but the mechanism is different. Classical distillation with logits typically transfers a richer signal.
When building a distillation dataset the same principles apply as for fine-tuning in general — covered in detail in the article Fine-tuning dataset — how much and what quality.
Practical steps toward your own distilled model
If you want to try distillation in practice, the following pipeline works for most domain use cases:
Step 1 — Define the domain and task. The narrower the domain, the better the student will learn. "Classification of Fanuc CNC machine error codes" is a better scope than "industrial documentation."
Step 2 — Prepare seed data. Roughly 150–300 manually verified examples (question/answer, input/output) from your domain. This is the foundation of quality — garbage in, garbage out applies doubly here.
Step 3 — Teacher generates expanded data. Run a frontier model on your seed examples, let it generate variations, answer related questions, and produce chain-of-thought explanations. The target volume for a working SFT is typically in the thousands of examples.
Step 4 — Train the student. For most domain cases, standard SFT with Unsloth or Axolotl on a 1B–8B model is sufficient. If you have access to the teacher's logits (open model), add a distillation loss (KL divergence) — TRL has direct support for this. For production pipelines and method selection I also recommend reviewing SFT, DPO, GRPO — which method when.
Step 5 — Evaluate and compare. Measure the student on a holdout set from your domain, compare with the teacher and with the base model without distillation. What you care about is the delta — how much the student closed the gap to the teacher relative to baseline. If the delta is less than 5–10 % relative, the distillation was successful.
Step 6 — Deploy. You can quantise the distilled student (for example into .gguf format via llama.cpp) to further reduce memory requirements at inference time. vLLM or Ollama handle serving even for small teams without DevOps infrastructure.
Common mistakes
Student that is too large. If you need a model that fits on an 8 GB GPU, don't start with a 13B student. Distillation doesn't override physics — smaller hardware requires a smaller model.
Teacher and student from incompatible domains. A teacher trained exclusively on English code will be a poor teacher for Slovak customer service. The teacher must be competent on your target domain — otherwise you are distilling the wrong behaviour.
Ignoring scores on other tasks. Distillation can degrade the student's capabilities on tasks outside the training distribution. If your student also handles tasks beyond the distilled domain, measure those too. Catastrophic forgetting is real in distillation — not just in fine-tuning.
Expecting a small student to match a large one across the board. The most common misconception. Distillation is an optimisation for a specific slice of capabilities, not a cloning of the teacher.
Frequently asked questions
Is distillation the same as quantisation?
No. Quantisation compresses an existing model by reducing the numerical precision of its weights — the model stays the same, it just takes up less space. Distillation creates an entirely new, smaller model that is trained to imitate the behaviour of the larger one. Both approaches are commonly combined: you distil first, then quantise the resulting student model.
How much data do I need for distillation?
It depends on the use case and on whether you are using the teacher's logits or only its outputs (synthetic data). For narrow-domain distillation via SFT, useful results are achievable with thousands of examples — provided they are high quality. For a robust production model with no regressions, count on tens of thousands of examples. Seed data of 150–300 manually verified examples is enough; the teacher can generate the rest.
Can I distil from a closed API model whose logits I cannot access?
Yes, but this is incomplete distillation — or more precisely, training on synthetic data. The frontier model generates answers, and you train the student on those texts via standard SFT. Results can be good for most domain tasks, but you will not get the richer signal from soft labels. Check the terms of service of the specific provider before you start — some explicitly prohibit training on their outputs.
When is it better to distil and when to fine-tune directly?
If you already have a high-quality base model of the required size (for example Phi-4, Qwen3 4B, Gemma 3 4B) and you have quality domain data, direct fine-tuning is simpler and faster. Distillation adds value when the teacher has capabilities that your existing data does not capture — for example complex reasoning, long chain-of-thought, or nuanced uncertainty in its distributions.
What hardware do I need to train the student?
The same as for standard LoRA or QLoRA fine-tuning of a model of that size. A 1B–3B student trains comfortably on an RTX 3060 12 GB or a higher-end card. A 7B–8B student with QLoRA runs on an RTX 3090/4090. Training is typically far shorter than full pretraining — on the order of hours, not days.
*At MP Industrial Solutions we help companies move from a promising pilot to robust production deployment — including choosing the right model and technique. If you are weighing whether distillation, direct fine-tuning, or a combination with RAG is right for your use case, we are happy to assess your specific situation.*
