The large language model market has shifted so radically in the past year that the old approach — pick one frontier model and let it do everything — has stopped working. Today you have open-weight models with million-token context windows, cloud APIs priced close to zero, local deployment on a single server, and multi-billion-parameter MoE architectures that are smaller than they look. At the same time, choosing a model without a framework is a lottery — not because of model quality, but because most decisions are made without a clear brief.
This article provides a concrete decision framework. Four dimensions — task, infrastructure, cost, and privacy — each with a series of filters that narrow the candidate list to two or three contenders. All numbers come from verified sources; where the data is ambiguous, we say so directly.
Step 1 — Define what the model does (and what it does not)
Before selecting a model you need to know what type of task it will handle. Language models are not equally strong in all areas, and a model that tops arithmetic benchmarks can fall short on long documents.
Three core task categories:
- Extraction and classification: Pulling data from scans, labelling tickets, summarisation. Smaller models are sufficient. Latency and throughput matter more than raw intelligence.
- Generation and reasoning: Writing reports, contract analysis, coding, planning. Benchmark quality matters here — prefer frontier or strong open-weight models from the Llama, Qwen, or Mistral families.
- Long context: Analysing extensive documentation, corporate archives, meeting-minute summarisation. Models diverge dramatically here — not all handle retrieval in the middle of megabytes equally well, even when the context window nominally exists.
Once you know the task type, you know which benchmarks to look at: MMLU, HumanEval, and GSM8K for general reasoning and code; IFEval for instruction following; RULER or needle-in-a-haystack tests for long context. Read benchmarks carefully, though — they measure specific conditions, not production reality. More on this in How to Read LLM Benchmarks.
Step 2 — Open-weight vs cloud API: this is the real axis of the decision
Not "which model", but "where does it run". This decision determines 80 % of the remaining parameters.
Cloud API (Anthropic, OpenAI, Google, Mistral, DeepSeek)
Advantages: - Zero infrastructure overhead — you pay for tokens, not for GPUs - Highest performance across all categories (frontier models lead benchmarks) - Context windows unconstrained by your own VRAM - SLA and availability managed by the provider
Limitations: - Your data and prompts leave your infrastructure - Prices are variable; at high volumes monthly costs can reach five figures - Regulated industries (healthcare, legal, finance) face strict data egress constraints
Reference prices in 2026: frontier models (Claude Opus, GPT-5.x) run on the order of $3–25 per million input tokens depending on tier. DeepSeek and similar Chinese-family models via API are typically 10–30× cheaper than US frontier. Prices have fallen significantly over the past year, so older calculations no longer apply.
On-prem / local deployment (open-weight models)
Advantages: - Data never leaves the network — the only viable path for GDPR-sensitive or classified workloads - Predictable costs (hardware + energy) after the initial investment - Full control over the model, prompt logs, and versions
Limitations:
- One-time GPU investment and IT overhead
- Weaker performance than frontier cloud models (the gap is narrowing but remains)
- You need a serving layer — vLLM, SGLang, or Ollama (rule out Ollama for production serving; see below)
If you want a systematic treatment of this decision, see the deeper analysis in Local LLM vs Cloud. For regulated industries additional conditions apply — on-prem elimination of data egress is not sufficient for compliance without audit logs and managed access, which is covered in On-prem LLM for Regulated Industries.
Step 3 — Model size: bigger is not always better
The open-weight market in 2026 is full of MoE (Mixture of Experts) architectures. What this means in practice: a model labelled "400B parameters" may activate only ~17 billion during a single inference request. Parameter count and active parameters are two different numbers.
Practical implications for selection:
- MoE models (e.g. Llama 4 Maverick, Qwen 3.x MoE variants, Mixtral, DeepSeek V3): Lower compute at inference time, but you must load the full model to disk and VRAM. Large MoE models have hundreds of billions of parameters, only a fraction of which are active per token — yet VRAM must hold the entire model. A naïve focus on "activated parameters" therefore underestimates hardware requirements.
- Dense models (Gemma 3, Phi-4, older Llama 3.x): Simpler deployment; parameter count ≈ compute. Phi-4 and smaller Gemma 3 models are excellent for edge deployments and embedded use cases.
Approximate VRAM requirements (excluding KV cache) for common sizes:
- 7–9B model: Q4_K_M format ≈ 5–7 GB VRAM; FP16 ≈ 16–19 GB
- 13B model: Q4_K_M ≈ 8 GB; FP16 ≈ 26 GB
- 70B model: Q4_K_M ≈ 35–40 GB; FP16 ≈ 140–168 GB
Quantisation (GGUF Q4_K_M, AWQ 4-bit) is not automatically harmful — on most benchmarks it stays within 5–8 % of FP16 quality. Significant degradation only appears at Q2 and below. More on techniques and their trade-offs in LLM Quantisation (GGUF, AWQ, GPTQ).
For most B2B use cases: a well fine-tuned 13B model will outperform a generic 70B model on a narrow domain. Before deciding on size it is worth considering whether you have enough data for fine-tuning — covered in RAG vs Fine-tuning.
Step 4 — Latency and throughput: who is your user?
Two very different profiles with very different requirements:
Interactive (user-facing) chat or copilot: Latency is critical. The first token should arrive within 1–2 seconds. TTFT (Time to First Token) is the relevant metric here. A smaller model that responds quickly beats a large one that makes you wait.
Batch processing: Throughput is critical. Tokens per second across the full batch is what matters. Here a larger model at the cost of higher per-request latency is worthwhile, because you are processing tens of thousands of documents at once.
For serving infrastructure: vLLM is the production choice for most NVIDIA deployments — PagedAttention dramatically reduces KV cache fragmentation (from the typical 60–80 % waste to under 4 %) and continuous batching raises throughput 2–3× compared to static batching. SGLang is stronger for prefix-heavy workloads (RAG, agents, multi-turn) — benchmarks show ~29 % higher throughput on H100 and ~23 % faster TTFT versus vLLM.
Ollama is suitable for a single developer on a desktop, not for production multi-user deployments. With multiple concurrent users its throughput is significantly lower than vLLM.
Step 5 — Cost: where you actually pay
The cloud LLM API market is considerably more favourable on pricing than it was a year ago. But traps remain.
Context window ≠ cheaper solution. A 1M-token context does not mean you always send a million tokens — you pay for every token you send. KV cache grows linearly with sequence length. For example, a 70B model at 128K context requires ~40 GB of KV cache alone; for four parallel requests at 128K that is ~160 GB on top of the model itself. The context window is capacity, not a constant.
Prompt caching is an important tool for reducing costs on repeated system prompts. As a rough guide: with a good workload you can save 50–70 % on input token costs. But cache write tokens cost 1.25–2× more than regular tokens on some platforms — the saving only materialises once you read the same prefix repeatedly. Workloads with unique long prompts gain nothing from caching. More in Prompt Caching and Cost.
Routing (sending simple questions to a cheap model, complex ones to an expensive one) can preserve 95 % of quality at a fraction of the cost when well calibrated. Research from Berkeley showed that with a good router, 75–90 % of calls go to the smaller model. This is easy to implement but requires baseline evals — without measurement you do not know where the cut-off is.
Step 6 — Licences and terms of use
This gets overlooked until it becomes a problem.
Open-weight models are not automatically free for any use:
- Llama 4 (Meta): Meta custom licence. Restrictions apply for deployments exceeding 700 million monthly active users. For most B2B enterprise deployments this limit is not relevant, but you need to read it.
- Qwen 3.x: Apache 2.0 — commercial use, modification, and distribution without fees. Mistral: smaller models (e.g. Mistral Small) are Apache 2.0; larger ones (Mistral Large) carry a proprietary Mistral licence — verify for the specific model you intend to use.
- DeepSeek V3: MIT licence — maximum freedom including fine-tuning and redistribution.
- Gemma 3 (Google): proprietary Gemma licence — commercial use is permitted, but it is not an OSI-approved open-source licence. Read the terms carefully.
- Phi-4 (Microsoft): MIT.
For closed-weight cloud APIs (Claude, GPT-5.x, Gemini), terms are governed by SLA and terms of service — pay attention to data retention policy and the opt-out from training data use.
Regulated industries should have a DPA (Data Processing Agreement) signed before the first production call.
Step 7 — Context window: when 1M tokens helps and when it does not
Almost every flagship model in 2026 offers a context window of at least 128K tokens. Llama 4 Scout goes up to 10M. Claude (higher tiers), Gemini 2.5, and Llama 4 Maverick offer 1M; DeepSeek V3 has 128K.
The question is not "which has the bigger context" but "do I actually need it?".
Research shows that models with growing context exhibit "context rot" — retrieval accuracy degrades when relevant content is surrounded by large amounts of irrelevant text. This is especially true for multi-hop questions that require combining information from different parts of a document.
Practical rule: if your use case involves long documents (contracts, technical manuals, archives) but queries are targeted in nature, RAG will be more economical and more accurate than feeding the entire document into context. Long context makes sense where you genuinely need the model to read the whole document at once — generating an abstract from a 200-page report, analysing a code base.
Practical decision tree
This process will narrow the field to two or three candidates in practice:
- 1.Can data leave your network? → No: open-weight + local serving. Yes: continue.
- 2.Is throughput or latency critical and volume high? → Yes: consider local serving. No: cloud API.
- 3.What is the task? → Simple extraction/classification: smaller model (7–13B or cheap API). Complex reasoning: frontier or strong 70B+.
- 4.Do you have a specific domain with sufficient data? → Consider fine-tuning a smaller model before purchasing a larger one.
- 5.What is the licence? → Filter for Apache 2.0 / MIT for production commercial deployments with no legal overhead.
Frequently asked questions
Which open-weight model is the best today?
There is no single correct answer. In 2026 the models leading various benchmarks include Llama 4 Maverick, Qwen 3.x, DeepSeek, and Mistral Large — it depends on the task. Qwen-family models are strong for code and reasoning; Llama 4 Scout excels at long context (10M context window). Always test on your own data, not just public benchmarks.
Is DeepSeek reliable for European deployments?
DeepSeek offers open weights under an MIT licence — you can download the model and run it locally with no calls to Chinese servers whatsoever. From a GDPR perspective, a local DeepSeek deployment is just as "clean" as Llama or Mistral. The cloud API version through DeepSeek servers is a different question — the same data egress considerations apply there as with US providers.
What is MoE and do I need to care about it when choosing?
MoE (Mixture of Experts) is an architecture where the model activates only a subset of parameters for each token. The practical consequence: lower compute at inference time, but a larger total VRAM footprint. If you are deploying locally, you must load the full model into memory even though only a fraction is used per token. For cloud APIs this detail is irrelevant — you pay for active parameters.
Is fine-tuning worth it instead of buying a larger model?
In many cases yes — but only if you have enough high-quality data and a clearly defined domain. A well fine-tuned 13B model can outperform a generic 70B on a narrow industrial task. If you do not have sufficient data (SFT requires on the order of thousands of quality examples), fine-tuning is more likely to hurt than help. We cover the RAG vs fine-tuning decision in RAG vs Fine-tuning.
How do I know I chose correctly?
The right choice is validated by evaluations on your own data and use cases — not just by comparing benchmarks. Define 50–100 test cases with expected outputs, run them on the candidates, compare. We describe this process in detail in How to Measure LLM Application Quality.
*At MP Industrial Solutions we help companies navigate model selection in a structured way — from mapping use cases through candidate testing to production deployment on their own infrastructure. If you are working through this decision and want to avoid costly dead ends, we would be happy to talk.*
