When a client presents requirements for an on-premises deployment of a 70B model and has a single server with two GPUs available, the initial reaction is almost always the same: "That won't fit." In FP16 that's true — such a model needs 140–168 GB of VRAM, which is not a commonly available configuration. But in 4-bit quantisation the same model drops to 35–40 GB, which is two mid-range cards. And the quality loss? With the right format it's less than most people expect.
Quantisation is today one of the most important practical skills for deploying LLMs on your own hardware. This article explains what happens to model weights during quantisation, what the difference is between the GGUF, AWQ and GPTQ formats, how much quality you actually lose at each level, and how this differs from distillation — a technique people often confuse with quantisation.
What quantisation does to weights
Modern LLMs are trained in BF16 or FP16 format — each parameter occupies 16 bits. A model with 7 billion parameters therefore needs roughly 14 GB just for the weights themselves (plus KV cache and activations on top of that).
Quantisation represents the same weights at lower numerical precision. Instead of a 16-bit float you use an 8-bit or 4-bit integer. The formula is straightforward: VRAM (GB) ≈ (number of parameters in billions × bits) / 8. With 4-bit quantisation of a 7B model you get 5–7 GB — enough to fit on a standard workstation.
The cost of that saving is information loss. Numerical precision falls, and some subtle differences between weights get "flattened" to the same representable value. The result is a slight drop in output quality — the question is how large.
It is important to understand what quantisation is not: it is not distillation. Quantisation preserves the architecture and parameter count of the original model and only changes the numerical format of the weights. Distillation, by contrast, transfers knowledge into a smaller model with a different architecture — it is a knowledge transfer, not a compression. For more detail on distillation see the article Model distillation: making a small, fast model from a large one.
GGUF — format for CPU and cross-platform deployment
GGUF is a binary format developed for llama.cpp and today also natively supported in Ollama. Its key property: the model can run entirely on CPU, or on a hybrid CPU+GPU combination where some layers run on the GPU and the rest on CPU using system RAM.
For enterprise deployment this means a practical advantage: a developer can run a 13B model on a workstation without a dedicated GPU, or on a server with less VRAM than the FP16 model would require.
Quantisation levels in GGUF
The naming follows the scheme Q<bits>_<variant>. The most important levels:
- Q8_0 — 8-bit quantisation, smallest quality loss (~1–2 % vs FP16). Recommended when you have enough VRAM and want maximum precision.
- Q4_K_M — 4-bit, "medium" variant with adaptive quantisation for sensitive layers. Retains ~92–95 % of FP16 quality. The standard sweet spot for most use-cases.
- Q4_K_S — 4-bit, "small" variant, slightly smaller than K_M at comparable quality.
- Q3_K_M — 3-bit, significantly smaller footprint, noticeable degradation on complex reasoning.
- Q2_K — 2-bit, severe degradation. Usable only under extremely constrained hardware and non-critical tasks.
The letter K in the name stands for "k-quants" — a more sophisticated method that applies higher precision to the layers most sensitive to quantisation errors (typically embedding and output layers) and more aggressive compression to less critical parts. The result is a better quality-to-size ratio compared to simple uniform quantisation.
AWQ — calibrated quantisation for the GPU stack
AWQ (Activation-aware Weight Quantization) is a method that takes activation statistics into account during quantisation — that is, the actual input values the model sees in practice. Based on those statistics it identifies "important" weights and preserves higher precision for them, while quantising the remainder more aggressively.
The result: an AWQ 4-bit format retains ~93–96 % of FP16 quality, slightly better than GPTQ at the same model size. The difference is most visible in longer text and code generations, where coherence degrades faster with uncalibrated quantisation.
AWQ requires a GPU at inference time — it is not intended for CPU-only environments. It is natively supported by vLLM and TGI. For production NVIDIA deployments it is one of the preferred formats today.
GPTQ — maximum throughput on NVIDIA GPUs
GPTQ (Generative Pre-trained Transformer Quantization) is an older and well-established method. It uses second-order derivatives (Hessians) to minimise quantisation error for a given batch calibration. In practice, GPTQ 4-bit is slightly behind AWQ in quality retention (~90–93 % vs FP16), but delivers maximum throughput on NVIDIA GPUs via fused kernels.
For scenarios where the priority is the highest possible tokens per second (for example, serving multiple concurrent users), GPTQ combined with vLLM produces strong results. Like AWQ, it requires a GPU — CPU inference is not natively supported.
Format comparison — when to use what
- GGUF — when you need cross-platform deployment, CPU inference or hybrid CPU+GPU mode, or when you are deploying via
Ollamaon developer workstations - AWQ — pure GPU stack (NVIDIA), production deployment via
vLLMorTGI, output coherence is the priority - GPTQ — pure GPU stack (NVIDIA), production deployment via
vLLM, maximum throughput for multi-user serving is the priority
These formats are not interchangeable or mutually compatible — a model quantised in GPTQ cannot be loaded directly in llama.cpp, and vice versa. Before selecting a format you need to know what serving stack and hardware you plan to use.
Real quality loss — what the numbers say
The question clients ask most often: "How much quality will we lose?"
The answer depends on the quantisation level and on the type of task. Measurements across various models (Qwen, DeepSeek, Mistral, Llama) show a consistent pattern:
- Q8_0 / 8-bit: perplexity difference under 1–2 % vs FP16. Practically indistinguishable in everyday conversation.
- Q4_K_M / AWQ 4-bit: perplexity difference typically under 5–8 % vs BF16. For most tasks — information extraction, summarisation, classification — the difference is not visible to the naked eye. On complex multi-step reasoning (maths, code, long chains of steps) a slight drop may be observable.
- Q3 and below: degradation is noticeable. The model begins producing less coherent outputs, especially during long generations or tasks where precision matters.
- Q2: severe degradation. Only suitable under extreme hardware constraints and for non-critical tasks.
An important nuance: quality loss is not uniformly distributed. Models with more parameters tolerate more aggressive quantisation better — a 70B model in Q4_K_M will generally retain more capability than a 7B model in the same configuration. For small models (7B and below) the difference between Q8 and Q4 is more visible.
A second nuance: benchmarks measure averages. For a specific domain use-case (for example, analysing technical documentation), it is worth running your own comparison of FP16 vs Q4 on a sample of real inputs — a few dozen examples is usually enough to reach an indicative conclusion.
VRAM savings in practice
Concrete numbers for the most common model sizes:
- 7–9B model: FP16 ~16–19 GB → Q8_0 ~8–13 GB → Q4_K_M ~5–7 GB
- 13B model: FP16 ~26 GB → Q8_0 ~14 GB → Q4_K_M ~8 GB
- 27–34B model: FP16 ~54–68 GB → Q8_0 ~30–34 GB → Q4_K_M ~17–20 GB
- 70B model: FP16 ~140–168 GB → Q8_0 ~70–75 GB → Q4_K_M ~35–40 GB
On top of these figures, KV cache grows with context length and the number of concurrent requests. A 70B model with a long context can consume an additional 20–40 GB of KV cache across a handful of parallel conversations. When planning infrastructure, therefore, do not account only for the weights — KV cache is an equally important variable. More on this in the article Which GPU for LLM inference.
Quantisation and the serving stack — what vLLM and Ollama have to say about it
Ollama is an excellent tool for developer deployments — you download a GGUF model with a single command and it runs locally without configuration. Under the hood it uses llama.cpp. The key limitation: Ollama is not a production serving framework. With multiple concurrent users you may see significantly lower throughput compared to vLLM — regardless of quantisation.
vLLM with AWQ or GPTQ models is designed for production environments with multiple concurrent requests. It uses PagedAttention for efficient KV cache management and continuous batching — the resulting throughput with multiple concurrent users can be substantially higher than with Ollama. The trade-off is a more complex setup and a GPU requirement.
For enterprise deployments a typical arrangement is: developers and testers work with GGUF models via Ollama, while the production inference server runs on vLLM with AWQ or GPTQ. Both approaches coexist — it is not either/or. A detailed comparison of these serving tools is in the article vLLM vs SGLang vs Ollama.
Quantisation vs distillation — an important distinction
These two terms are mixed up in practice, but they are fundamentally different techniques.
Quantisation preserves the original model architecture — you only change the numerical format of the weights. The process is fast, requires no training, and existing quantisation tools can handle a 70B model in a few hours on a standard server. What is lost is information at the level of numerical precision.
Distillation creates a new, smaller model by training it on the outputs of a larger (teacher) model. It is a full training process — it requires data, compute time, and hyperparameter tuning. The result is a model with fewer parameters and a different architecture that has "learned" to mimic the teacher. What is lost is capacity — a smaller model simply does not have enough parameters for some complex tasks.
In practice both approaches are complementary: you can quantise a distilled model afterwards. For example, a small model created by distilling a frontier model can be distributed as a GGUF Q4 for edge deployment.
How to approach quantisation selection in a project
Several practical recommendations from deployments we have seen:
- 1.Start with Q4_K_M — for most enterprise use-cases (extraction, classification, Q&A over documents) Q4_K_M delivers sufficient quality at a reasonable VRAM footprint
- 2.Validate on your own data — if the application has specific requirements (e.g. precise extraction of numerical values from technical reports), compare FP16 and Q4 on a sample of 50–100 real inputs before making a final decision
- 3.Do not underestimate KV cache — when planning hardware, account for KV cache on top of the weight size, especially if you plan long contexts
- 4.Factor in the serving stack — if you are going with vLLM, AWQ is a good choice; if you are using Ollama or need cross-platform, GGUF; if you are prioritising throughput on NVIDIA, GPTQ
- 5.Avoid Q2 — degradation is severe; better to use a smaller model in Q4 than a larger model in Q2
Frequently asked questions
Can I quantise any model myself?
Yes, tools such as llama.cpp (for GGUF), AutoAWQ and AutoGPTQ are open-source and freely available. For common sizes (7–34B) even a standard server can handle quantisation within a few hours. For 70B and larger models you need significantly more RAM (quantisation runs in FP16 before conversion). In practice, for most teams it is simpler to reach for pre-quantised models from Hugging Face — quality is verified and you save time.
Is the difference between Q4_K_M and Q4_K_S noticeable?
Minimal in everyday use. Q4_K_S is a somewhat smaller file at very similar quality. Q4_K_M is the more conservative choice, retaining slightly more precision in sensitive layers. For most use-cases we recommend Q4_K_M as the starting point.
Quantisation and GDPR — is there anything to consider?
Not directly — quantisation does not change the model's behaviour when processing data, only its VRAM footprint and speed. GDPR implications relate to *where* the model runs and *who* has access to the data. On-premises deployment of a quantised model can help with GDPR from a data locality perspective (data does not leave your infrastructure), but compliance requires much more — audit logs, access controls, a documented DPA process. More in the article On-prem LLM for regulated industries.
Can I fine-tune a quantised model further?
Direct fine-tuning of a quantised model (e.g. GGUF Q4) is not a standard workflow — most fine-tuning frameworks work with FP16 or BF16 weights. QLoRA is the exception: it allows training with 4-bit quantised base weights while the LoRA adapters are trained at higher precision. Details in the article LoRA vs QLoRA vs full fine-tuning.
What is the difference between NVFP4 and standard 4-bit quantisation?
NVFP4 is a hardware-native format specific to the latest NVIDIA GPUs with Blackwell architecture. Unlike software 4-bit methods (GPTQ, AWQ, GGUF Q4), NVFP4 is directly accelerated in the chip's tensor cores. The result is higher throughput compared to generic 4-bit formats on the same hardware. If you do not have Blackwell GPUs, this format does not apply to you — standard AWQ or GPTQ is the right choice.
*Choosing the right quantisation format and serving stack is a decision that significantly affects the cost and performance of a production deployment. At MP Industrial Solutions we help companies move from a first experiment with a local LLM to a production infrastructure — including hardware assessment, model selection, and choosing the right quantisation format for a specific use-case. If you are considering an on-prem LLM deployment, we are happy to schedule an initial consultation.*
