A company deploys a local model via Ollama, builds an internal chatbot, the team loves it. Three months later more users arrive, the queue grows, and response times stretch from two seconds to fifteen. Someone says: "let's buy a more powerful GPU." They do. Response times drop to eight seconds. The problem is not the hardware — the problem is a serving stack that was never designed for production workloads.
This is a scenario we see repeatedly. Not because teams are doing anything foolish — Ollama is an excellent tool for the purpose it was designed for. The problem arises when a development tool reaches production without anyone asking: "what do we actually need from a serving stack?" This article offers decision criteria instead of a blanket recommendation — because the right choice depends on workload, team, and infrastructure.
Why the serving stack matters more than it appears
A serving framework is not just a "wrapper around a model." It is an orchestrator that decides how concurrent requests are batched, how KV cache is managed, and how memory is allocated across multiple simultaneous requests.
Classic static batching waits until a batch is full, then sends all requests at once — the model runs, results are returned, and it waits for the next batch. That is simple, but efficiency drops when short and long requests are mixed together. Continuous batching (implemented in vLLM, SGLang, and TGI) solves this differently: every generated token is an opportunity to add a newly arrived request to the batch. The result is 2–3× higher throughput without changing hardware.
Another critical dimension is KV cache management — the intermediate attention results stored for every token in the context. A naïve implementation reserves memory for the maximum possible context length upfront, even when most requests are much shorter. PagedAttention (vLLM) solves this by paging the KV cache similarly to how an OS pages RAM — no upfront reservation, dynamic allocation instead. The result is a dramatic reduction in fragmentation: from the typical 60–80 % waste down to under 4 %.
These differences multiply in a production environment with dozens of concurrent requests far beyond what simple single-request benchmarks reveal.
vLLM: throughput as the primary priority
vLLM is the de facto standard for production LLM serving when maximum throughput on NVIDIA GPUs is the primary goal. It originated at UC Berkeley, has the broadest integration ecosystem, and is actively developed.
Key technical features:
- PagedAttention — KV cache management via virtual pages, dramatically reduces memory fragmentation and enables higher parallelism
- Continuous batching — dynamically adds requests to the current batch without waiting
- OpenAI-compatible API — migrating from a cloud API to self-hosted is usually just a matter of changing the URL and API key
- Support for NVIDIA, AMD (ROCm), Google TPU, Intel Gaudi
- Native support for
GPTQ,AWQ,FP8, andNVFP4quantisation XGrammarbackend for structured output (JSON mode) with overhead under 40 µs per token
When vLLM clearly wins:
When you are running an API endpoint accessed by multiple users or processes concurrently — an internal company chatbot, a higher-volume RAG backend, a production application API server. If you benchmark on single-request latency, the gap versus competitors is smaller. But at 8, 16, or 32 concurrent requests, the difference becomes pronounced.
On Blackwell GPUs with NVFP4 quantisation, benchmarks show up to ~16× higher throughput compared to Ollama on the same hardware — a difference that changes the economics of an entire project.
Limitations:
vLLM has a steeper learning curve. Configuring it for a production deployment requires understanding parameters such as --tensor-parallel-size, --max-model-len, and --gpu-memory-utilization. For a team without LLM infrastructure experience, the initial setup is non-trivial. For purely CPU or consumer hardware without NVIDIA GPUs, the ecosystem is weaker.
SGLang: complex workloads and structured output
SGLang (Structured Generation Language) emerged from a research context focused on a different class of problems: multi-turn conversations, agentic workloads with long shared prefixes, and structured output (JSON schemas, grammars).
The key innovation is RadixAttention — an LRU cache of KV values organised into a radix tree. When multiple requests share the same prefix (for example, the same system prompt or the same document context), SGLang computes that prefix once and shares it across requests. In agentic RAG scenarios, where every request starts with the same long document context, this can make an enormous difference.
Where SGLang outperforms vLLM:
On prefix-heavy workloads, benchmarks show ~29 % higher throughput on H100 and ~23 % lower TTFT (Time to First Token) — 79 ms vs 103 ms. These are not negligible numbers in interactive applications where TTFT directly affects perceived speed.
For structured decoding (JSON mode, grammars), SGLang is faster: constrained decoding runs ~3× faster than older implementations because grammar compilation proceeds more efficiently.
Typical use cases where SGLang shines:
- Multi-turn agentic workloads where each round shares a long history prefix
- Batch inference over a large document corpus with the same system prompt
- Applications with intensive JSON output (structured data extraction, classification)
- RAG pipelines where the same document is queried multiple times in one session
Limitations:
SGLang has a somewhat smaller ecosystem than vLLM and a smaller community, which shows in the speed of edge-bug fixes and the availability of documentation. For standard inference workloads without prefix optimisation, the gap versus vLLM is smaller and the choice comes down more to team preference.
Ollama: developer experience first
Ollama is a different category of tool. It is not a production serving framework — it is a developer desktop tool that does one thing excellently: it lets you run a local model in five minutes, with no configuration, on any hardware including Mac, Linux, and Windows.
Under the hood it runs llama.cpp, which is optimised for CPU inference and efficient operation with GGUF quantised models. For single-user experimentation and development, this is the perfect stack.
Where Ollama makes sense:
- Developer desktop — local experiments, prototyping, model testing
- Single-user internal tools with low load (one to two concurrent users)
- Teams without DevOps experience who need something working quickly
- Mac or Windows environments where vLLM/SGLang have no native GPU support (or limited support)
- On-device deployment on developer laptops
Why Ollama is not enough for production:
The llama.cpp running under Ollama is not designed for concurrent requests. When 8 parallel requests arrive, the queue is processed serially — without continuous batching. Benchmarks consistently show that on the same hardware, vLLM is ~2.3× faster at 8 concurrent requests. At 16 requests, the gap is even wider.
This is not an Ollama bug — it is a consequence of design decisions that prioritise simplicity and compatibility over throughput. For a developer desktop that is the right trade-off. For a production endpoint with dozens of users, it is a problem.
Decision-making by workload and team
Instead of a blanket recommendation, here is a decision matrix:
Small team (2–5 people), just starting with local LLMs, experimenting:
Start with Ollama. You will learn to work with models without infrastructure overhead. When you hit the performance limits (and you will), it will be obvious why migration is necessary.
Production API endpoint with multiple concurrent users, NVIDIA GPU:
vLLM is the default choice. Broadest ecosystem, best documentation, OpenAI-compatible API. If the team has no LLM infrastructure experience, expect setup to take days, not hours.
Agentic applications, RAG with long repeated prefixes, intensive JSON output:
Consider SGLang — RadixAttention will save GPU memory and latency on prefix-heavy workloads. For teams that have already deployed vLLM and it is working, there is no reason to migrate just for a marginal improvement. SGLang is relevant when you know your workload is prefix-heavy.
NVIDIA Blackwell (GB200/B200), maximum performance:
vLLM or TensorRT-LLM — both are optimised for NVFP4 quantisation on Blackwell GPUs. TensorRT-LLM has higher peak performance, but significantly higher setup and operational complexity.
Regulated environment, air-gapped network, no cloud dependencies: All three run fully offline. For production deployment in a regulated environment we recommend vLLM for its ecosystem and auditability. More on the specifics of regulated deployments in the article On-Prem LLM for Regulated Industries.
What a serving stack is not: separating it from quantisation and GPU sizing
When companies start working through "how to deploy an LLM," three topics regularly get conflated: serving stack, quantisation, and GPU sizing. These are separate decisions with different priority ordering.
Quantisation is a decision about the numerical precision of model weights (FP16 vs Q8 vs Q4 vs GPTQ/AWQ). It affects model size in memory and inference speed at an acceptable quality cost. Q4_K_M is within ~5–8 % of FP16 on most benchmarks — a difference that is not perceptible for most production use cases. Quantisation is orthogonal to serving stack choice: you can run a quantised model via vLLM or via Ollama. More on formats in the article LLM Quantisation (GGUF, AWQ, GPTQ).
GPU sizing is a decision about how much VRAM (and how many GPUs) you need for a given model and workload. This is a separate calculation: VRAM for model weights + KV cache for the expected number of concurrent requests × context length. A bad serving stack on the right hardware will still underperform; the right serving stack on undersized hardware will too. More on specific VRAM calculations in the article Which GPU for LLM Inference.
The practical decision order: first the model (size, capabilities), then quantisation (reduces memory requirements), then GPU sizing (how much VRAM is needed), then serving stack (how the workload is served). Many teams do this in reverse order and then wonder why optimisation is not enough.
KV cache and long context: the hidden memory burden
One number that gets underestimated when comparing serving stacks: KV cache grows linearly with context length. For a 70B model at 128K context, the KV cache alone can occupy around 40 GB — on top of the model weights. For four parallel requests with such a context that is ~160 GB additional.
Modern models with Grouped Query Attention (GQA) dramatically reduce this burden compared to classic multi-head attention — most current models (Llama 4, Qwen 3, Mistral Large) include GQA. A further optimisation is KV cache quantisation to INT8/FP8, which halves its size with minimal quality loss.
For long-context workloads — such as processing long industrial documents or multi-turn conversations in technical support — this number is critical when deciding on GPU configuration. vLLM via PagedAttention manages KV cache dynamically and more efficiently than a naïve implementation; SGLang via RadixAttention additionally shares KV cache for repeated prefixes.
Practical implication: a "1M token context window" sounds great to a customer. In practice, every request with 1M tokens demands tens of gigabytes of KV cache. For most production use cases, RAG is more cost-effective than filling an entire document into the context — even when the model technically supports long context.
Monitoring and observability in production
Whatever serving stack you choose, a production deployment without monitoring is flying blind. Three metrics to track from day one:
TTFT (Time to First Token) — how long it takes for the model to produce the first token. Directly affects perceived speed in interactive applications. For conversational UI, TTFT under 300–500 ms is the threshold where users experience the response as "instant."
Throughput (tokens/second) — global and per-request. Important for batch workloads and capacity planning.
Queue depth and queue latency — when the queue grows, it signals either a need for horizontal scaling or a review of the batching configuration. A growing queue with stable TTFT indicates the problem is capacity, not serving efficiency.
Both vLLM and SGLang export Prometheus metrics natively — a Grafana dashboard is an hour's work. For teams also managing the cost side, an interesting approach is LLM routing: sending simple requests to a smaller, cheaper model and complex ones to a larger. More on this in the article LLM Routing and Cascading.
Frequently asked questions
Can I use Ollama in production if I only have one or two concurrent users?
Yes, for truly small workloads — one team, a handful of people, low request frequency — Ollama works fine in production. The problem arises with growth. When load doubles and Ollama can no longer keep up, migrating to vLLM is not a trivial change: a different configuration model, different process management, different deployment patterns. If you expect load to grow, it is worth building the serving stack correctly from the start.
Is vLLM or SGLang better for RAG applications?
It depends on the specific RAG architecture. If every request starts with the same system prompt with little variation and short documents change with each request, vLLM and SGLang are comparable. If the architecture shares a long document context across multiple requests in one session (for example, analysing a long manual with multiple questions), RadixAttention in SGLang can deliver meaningful memory and latency savings. For more on RAG architectures see Agentic RAG.
How does vLLM differ from TensorRT-LLM?
TensorRT-LLM from NVIDIA achieves higher peak performance on NVIDIA hardware (especially Blackwell) through fused kernels and NVFP4 quantisation. The cost is significantly higher complexity: models must be compiled before deployment, the pipeline is less flexible, and setup takes longer. vLLM is the more pragmatic choice in most production scenarios — you get 80–90 % of TensorRT-LLM performance at a fraction of the operational complexity. TensorRT-LLM makes sense for extreme throughput requirements or when optimising for a specific model on specific hardware.
Do these frameworks work with quantised models?
Yes, all three support quantised models. vLLM and SGLang natively handle AWQ and GPTQ formats and have FP8/NVFP4 support for Blackwell GPUs. Ollama works primarily with the GGUF format (llama.cpp). For production deployment on NVIDIA GPUs, AWQ or GPTQ is generally preferable to GGUF because it leverages optimised CUDA kernels. For cross-platform or CPU deployment, GGUF is more practical.
How quickly can I migrate from Ollama to vLLM?
If your application communicates via an OpenAI-compatible REST API (which is the case for most Ollama deployments), migrating to vLLM is, from the application code perspective, just a change of base URL and API key. The larger effort is on the infrastructure side: deployment, monitoring, capacity configuration. For a team doing it for the first time, budget one to two days of work for a functioning production deployment.
*At MP Industrial Solutions we help companies design and deploy LLM infrastructure that matches real workloads and real teams — not just demos. If you are working out which serving stack is right for your use case, or if Ollama is starting to creak under load, we are happy to assess it together.*
