When a client asks "which GPU for a local LLM," they usually have one question in mind: will the model I want to use actually fit? The answer depends on three numbers — model weight size, KV cache size at the target context length, and framework overhead. All three can be calculated upfront. Yet most hardware decisions are made by guesswork, and the result is either a needlessly expensive server or one that falls short.
This article gives concrete numbers: how much VRAM a 7B, 13B, 34B, and 70B model actually consumes at various quantization formats, what fits on a 24 GB / 48 GB / 80 GB card, and when multi-GPU makes sense.
The basic formula: model weights
The simplest VRAM estimate for weights alone:
VRAM (GB) ≈ (number of parameters in billions × number of bits) / 8Examples: - 7B model, FP16 (16 bits): 7 × 16 / 8 = 14 GB (in practice ~16–18 GB with overhead) - 7B model, Q4_K_M (4 bits): 7 × 4 / 8 = 3.5 GB — in practice around 5–7 GB due to overhead - 70B model, FP16: 70 × 16 / 8 = 140 GB - 70B model, Q4_K_M: 70 × 4 / 8 = 35 GB — in practice around 38–40 GB
The weights formula is just the baseline. On top of it you always add KV cache, serving framework overhead, and in the case of quantized models, dequantization buffers as well.
What is KV cache and why it matters
During inference, the model generates tokens one by one. To avoid recomputing the entire sequence for each new token, it stores intermediate results — the so-called key-value pairs for each attention layer. These intermediate results form the KV cache.
KV cache grows linearly with sequence length. For production deployments with concurrent requests it quickly becomes as significant a constraint as the weights themselves.
Indicative KV cache sizes for common models:
- 7B model, 8K context, 1 concurrent request: ~1–2 GB
- 7B model, 32K context, 4 concurrent requests: ~8–16 GB
- 70B model, 32K context, 1 request: ~8–12 GB
- 70B model, 128K context, 1 request: ~40 GB
The exact numbers depend on the number of attention layers, heads, and groups (modern models use Grouped Query Attention, GQA, which dramatically reduces KV cache compared to older multi-head attention). Each model has a different multiplier — check the model's configuration file (config.json in the HuggingFace repository) before selecting hardware.
Practical implication: if you plan on long contexts or a larger number of concurrent users, KV cache constrains you just as much as weights do. It is not a technical detail — it is the primary cause of OOM errors during deployment.
What fits on 24 GB (RTX 4090, L4)
24 GB is the most common tier for on-prem development and smaller production deployments.
What fits comfortably: - 7B FP16 — weights ~16–18 GB, the remaining headroom for KV cache (~6 GB) is sufficient for moderate contexts (8–16K) with low concurrency - 7B Q8_0 — weights ~8–9 GB, plenty of KV cache headroom even at 32K context - 13B Q4_K_M — weights ~8 GB, generous KV cache space at 8K context - 13B Q8_0 — weights ~14 GB, tighter, but fits at shorter contexts
What does not fit: - 13B FP16 — weights ~26 GB, exceeds capacity - 34B in any common format — even at Q4 the weights (~17–20 GB) + KV cache won't fit within 24 GB under a real workload
For a 24 GB card, Q4_K_M is the practical standard for 13B models; for 7B models you have the freedom to choose Q8 or FP16 depending on how much context you need.
What fits on 48 GB (RTX 6000 Ada, A40, L40S)
48 GB opens meaningful room for larger models.
What works well: - 13B FP16 — fits comfortably, remaining KV cache headroom is sufficient at 16–32K context - 34B Q4_K_M — weights ~17–20 GB, ample room for a production KV cache - 34B Q8_0 — weights ~30–34 GB, tight, but workable at shorter contexts - 70B Q4_K_M — weights ~38–40 GB, the remaining KV cache (~6–10 GB) limits you to short contexts (4–8K) or 1 concurrent request
What is not ideal: - 70B FP16 — 140 GB, three times the capacity - 70B Q8_0 — ~70–75 GB, still exceeds capacity
A 48 GB card can serve a 70B model in Q4_K_M, but with a constrained context window. For most B2B use cases — RAG over documents, classification, structured extraction — a shorter context (up to 8K) is sufficient.
What fits on 80 GB (A100, H100, H200)
80 GB is the tier where most production 70B deployments run without compromise.
- 70B FP16 — weights ~140 GB, still doesn't fit in a single card. You need at least two.
- 70B Q8_0 — weights ~70–75 GB, fits, but leaves only ~5–10 GB for KV cache — limits to very short contexts or a single request
- 70B Q4_K_M — weights ~38–40 GB, remaining ~38–40 GB for KV cache — comfortable for 32K context, 2–4 concurrent requests
- 34B FP16 — weights ~54–68 GB, fits with reasonable KV cache headroom
On an H100 80 GB running 70B Q4_K_M with vLLM or SGLang you get production serving with throughput suitable for dozens of concurrent users.
Quantization: where to save without losing quality
Quantization reduces weight precision (from FP16/BF16 to INT8, INT4, etc.) in exchange for a smaller VRAM footprint and faster inference. The question is not "whether to quantize" — it is where quality is lost and where it isn't.
Indicative quality retention relative to FP16:
- Q8_0 (GGUF): ~98–99% — virtually indistinguishable. The standard choice when you have enough VRAM.
- Q4_K_M (GGUF): ~92–95% — the sweet spot. Most B2B use cases (RAG, classification, extraction, document reading) will not notice the difference.
- AWQ 4-bit: ~93–96% — slightly better for text coherence and code. Requires NVIDIA GPU, integrates cleanly with
vLLM. - GPTQ 4-bit: ~90–93% — maximum throughput on the NVIDIA stack, slightly lower quality than AWQ.
- Q2 (GGUF): significant degradation — noticeable on complex reasoning, long-form generation, and multilingual text.
The perplexity difference between Q4 and BF16 is below 6% across benchmarks. For most industrial use cases this is negligible. Quality loss becomes apparent when the model needs precise multi-step reasoning or generates long coherent texts — there Q4 can occasionally lose the thread compared to Q8.
For a detailed look at quantization formats, their differences, and use cases, see the GGUF, AWQ, GPTQ quantization overview.
Multi-GPU: when and how
When a model won't fit on a single card you have two options: quantize or add a GPU. Sometimes you need both.
Tensor parallelism — the model is split across layers (or attention heads) across multiple GPUs. vLLM and SGLang handle this natively. With two A100 80 GB cards you get an effective 160 GB of VRAM and can serve 70B FP16.
Pipeline parallelism — different model blocks run sequentially on different GPUs. Less efficient than tensor parallelism (idle time during transitions between cards), but works even on cards without NVLink.
Practical recommendations: - 2× RTX 4090 (2× 24 GB = 48 GB): 34B Q4_K_M comfortably, 70B Q4_K_M tight — fits, but KV cache is constrained - 2× A100 80 GB (2× 80 GB = 160 GB): 70B FP16 without compromise, 70B Q8_0 with generous KV cache - NVLink between cards significantly reduces communication overhead with tensor parallelism — for production deployments prefer cards with NVLink support (A100, H100, RTX 6000 Ada)
Most consumer GPUs (RTX 4090) do not have NVLink — they communicate over PCIe, which increases latency with multi-GPU splitting. For development purposes this is fine; for production with low-latency requirements the investment in workstation-class GPUs pays off.
Serving framework overhead
On top of weights and KV cache you add the overhead of the serving solution itself. vLLM uses PagedAttention — it manages KV cache in pages the way an OS manages memory, reducing fragmentation from the typical 60–80% waste to under 4%. Even so, reserve extra headroom:
- `vLLM` overhead: typically 1–3 GB extra for activation buffers, prefetching, and scheduling
- `SGLang` overhead: comparable to vLLM, plus a RadixAttention tree for prefix caching
Rule of thumb: budget ~10–15% on top of your estimated weights + KV cache. For a 24 GB card that means targeting ~20–22 GB effective utilization, not 24 GB.
Unlike production frameworks, Ollama uses llama.cpp under the hood — it is excellent for developer desktops and single-user experimentation, but is not designed for concurrent requests. For 8 parallel users vLLM is substantially faster (roughly 2–3×). For a comparison of serving solutions see vLLM vs SGLang vs Ollama.
Practical reference: what goes where
A summary for common scenarios:
Developer workstation, single user: - 7B–13B models, short context → 1× RTX 4090 (24 GB) with Q4_K_M or Q8_0 - 34B model → 2× RTX 4090 or 1× RTX 6000 Ada (48 GB) with Q4_K_M
Production server, 5–20 concurrent users: - 7B FP16 or Q8_0 → 1× A40 or L40S (48 GB) - 13B–34B Q4_K_M → 1× A40 or L40S - 70B Q4_K_M with short context → 1× A100 80 GB or H100 80 GB - 70B Q4_K_M with long context, higher throughput → 2× A100 or 2× H100
On-prem enterprise, regulated industry: - Quality without compromise → 70B Q8_0 or FP16 → 2× H100 80 GB (NVLink) - If on-prem makes sense for your use case from a GDPR and cost perspective, see also on-prem LLM for regulated industries
For each of these decisions: compute capacity is only one side of the equation. Equally important is what you want the model to do — and what the model could do if properly fine-tuned on your data. When to install a larger GPU versus fine-tuning a smaller model is covered in small fine-tuned vs large base model.
Frequently asked questions
Will a 70B model fit on a single RTX 4090?
Not meaningfully. The RTX 4090 has 24 GB of VRAM. A 70B model's weights in Q4_K_M take up around 38–40 GB — nearly double the card's capacity. To run inference on a 70B model you either need two cards (2× 24 GB via PCIe tensor parallelism) or a single 48 GB card, where it fits only with a constrained context window.
What is the difference between GPU VRAM and system RAM for inference?
The model must be loaded into GPU VRAM — system RAM cannot substitute for it during GPU inference. CPU inference (via llama.cpp without a GPU) runs from system RAM, but is orders of magnitude slower. Some solutions (e.g. llama.cpp with partial offloading) load some layers into VRAM and keep the rest in RAM — practical for development experiments, not for production.
How much VRAM does a long context add?
It depends on the model. As a rough guide: for a 7B model, every 8K tokens of context adds ~1–2 GB of KV cache. For a 70B model it is ~5–10 GB per 8K tokens. Modern models with Grouped Query Attention (GQA) are significantly more efficient than older ones. Before purchasing hardware, verify the num_key_value_heads parameter in the target model's configuration file.
Is Q4_K_M quantization sufficient for corporate documents and RAG?
In most cases, yes. For RAG over corporate documentation (information extraction, categorization, summarization), the difference between Q4_K_M and FP16 is hard to measure in practice. Degradation occurs with complex multi-step reasoning or when generating long coherent texts. If in doubt, test your specific use case with Q4_K_M and compare against Q8_0 — the result will usually surprise you.
When to go multi-GPU instead of a single larger card?
A single larger card is generally the better choice when one exists (lower communication overhead, simpler management). Multi-GPU makes sense when: (1) the model physically won't fit on a single card even with aggressive quantization, (2) you need redundancy for high availability, or (3) you plan to serve a large number of concurrent requests and throughput is the primary metric.
*Choosing the right GPU for local inference looks like a technical question at first glance — in reality it is an architectural decision that affects cost, context window, concurrent user capacity, and system availability. At MP Industrial Solutions we help companies go from a target use case through model selection to a concrete hardware recommendation — including a TCO calculation against cloud APIs. If you are preparing for your first on-prem deployment or reconsidering an existing server, we are happy to look at your numbers.*
