A company deploys a local 13B model and enables it for ten internal users. The first week runs smoothly. Then a second team arrives, adds another use case, the number of concurrent requests jumps to 20–30, and response time climbs from three seconds to forty. Someone proposes buying a more powerful GPU. They buy it. Response time drops to 25 seconds. The problem is not the hardware — the problem is what is happening with the KV-cache and batching under the hood.
LLM inference throughput and latency are not purely a hardware question. They are an architectural one. How the serving framework manages memory, how it batches requests, how it shares intermediate results — that determines whether you extract efficiency proportional to your GPU's price tag, or whether you pay for performance that is never actually used. This article explains the mechanisms and shows where the levers are for improvement without buying additional hardware.
Where performance disappears: the anatomy of LLM inference
To understand where optimisation is possible, you first need to understand what happens during text generation. LLM inference proceeds in two phases.
The prefill phase processes the entire input prompt at once — this is a parallelisable operation that loads the GPU like a matrix multiplication. The result is a set of key and value vectors (KV) for each input token, stored in memory. The time required for this phase is called TTFT — Time to First Token. TTFT determines when the user sees the first character of the response.
The decode phase generates tokens one at a time, with each new token depending on all preceding ones. This is a sequential operation — it cannot be parallelised within a single request. Generation speed is measured in tokens per second (tokens/sec).
The key insight: prefill is compute-bound (it loads the GPU cores), while decode is memory-bound (it loads memory bandwidth). These two phases have different bottlenecks and require different optimisations. Most naive implementations sacrifice throughput for simplicity.
Static batching: why it breaks down under mixed workloads
The classic approach to batching is straightforward: wait until a batch of requests fills up, send them all at once, the model processes them in parallel, results are returned. Then wait for the next batch.
The problem arises when requests vary in length — which is the common reality. One request wants a single-line answer; another generates 500 tokens. Static batching waits until the longest request in the batch finishes. GPU cores that could have started a new request sit idle. Effective GPU utilisation drops to 20–40% under a typical mixed workload.
On real workloads, this means the GPU you are paying for spends most of its time waiting. Not for data, not for the network — waiting for one slow request.
Continuous batching: dynamic batch refilling
Continuous batching solves this problem differently. Instead of waiting for the batch to finish, every generated token becomes an opportunity: the server checks the queue of waiting requests and immediately adds new ones to the current batch. A request that just finished frees its slot for a new one — without a pause.
The result is dramatic: on the same hardware, continuous batching typically achieves 2–3× higher throughput compared to static batching under mixed workloads. Not because the GPU is faster, but because idle cycles are filled with new requests.
vLLM, SGLang, and TGI implement continuous batching natively. Ollama does not implement it fully — it is optimised for single-user desktop use, not for production multi-user workloads. For a team with a dozen concurrent users the difference is noticeable: across eight parallel requests, vLLM is substantially more performant than Ollama on the same hardware. A detailed comparison of serving frameworks is in the article vLLM vs SGLang vs Ollama.
KV-cache: where VRAM "disappears" with long contexts
Every token the model processes — whether in the prefill or decode phase — produces a key and value vector for each attention layer. These vectors must be stored so they do not have to be recomputed when generating the next token. This storage is called the KV-cache.
The size of the KV-cache grows linearly with context length. For reference: a 70B model with a 128K context window requires tens of gigabytes for the KV-cache alone — on top of the VRAM the model weights themselves occupy. For four parallel requests at the same context length, that figure multiplies fourfold.
Modern models mostly implement Grouped Query Attention (GQA), which dramatically reduces KV-cache size compared to classic multi-head attention by having groups of query heads share common key and value vectors. Most current model families (Llama, Qwen, Mistral) include GQA — it is now the standard, not the exception.
Another lever is KV-cache quantisation: reducing the precision of KV vectors to INT8 or FP8 can halve their memory footprint with minimal quality loss. Production serving frameworks support this as an optional optimisation.
The practical consequence: if your VRAM is insufficient and you are considering long contexts, the KV-cache is the first place to look for the cause. Before buying additional GPU hardware, it is worth measuring how much VRAM the KV-cache actually occupies under your typical workload.
PagedAttention: virtual paging for the KV-cache
Naive KV-cache management creates another problem: it reserves memory for the maximum possible context length upfront, even when most requests are far shorter. Imagine allocating 128K slots because the model technically supports it, yet the average request is 2,000 tokens. Most of the allocated memory sits empty — and due to fragmentation it cannot be recycled efficiently.
PagedAttention, the key innovation in vLLM, solves this problem by drawing inspiration from operating systems. Just as an OS pages RAM into physical blocks and maps them virtually, PagedAttention divides the KV-cache into fixed-size blocks (pages) that are allocated dynamically as each request actually generates tokens. Memory is allocated on demand, not upfront.
The result: KV-cache fragmentation drops from typical 60–80% waste to under 4%. This means the same VRAM can serve many more parallel requests, or the same number of requests with a longer context.
For production deployments, this difference directly affects how many GPUs you need — and therefore serving costs.
Latency vs. throughput: they are not the same thing
One of the most common conceptual confusions: optimising for throughput and optimising for latency are partially conflicting goals.
Throughput measures how many tokens (or requests) the server generates per second in aggregate — relevant for batch workloads, offline document processing, high-volume APIs.
Latency measures how quickly a specific user receives their response — TTFT and total time to end of generation. Relevant for interactive applications, chatbots, and copilots.
The problem arises when continuous batching is applied aggressively: the server may intentionally delay sending a response in order to fill a larger batch — higher throughput, worse latency for the individual user. Most production frameworks expose parameters (--max-num-seqs, scheduler-delay-factor) to tune this trade-off according to your priorities.
For interactive applications, the correct strategy is usually to prioritise TTFT — a user who sees the first tokens within two seconds perceives the system as "fast" even if the full generation takes longer. Streaming responses (SSE or WebSocket) further improves this perception without changing actual throughput.
When comparing serving solutions: SGLang achieves roughly one-fifth lower TTFT than vLLM on prefix-heavy workloads at the same hardware, which is a noticeable difference for interactive applications. For batch serving the gap is smaller.
Prefix caching and KV-cache sharing
When multiple requests begin with the same system prompt — which is common in production applications — each request recomputes the same prefix from scratch. That is wasteful: running the prefill for the same 500-token system prompt 1,000 times a day.
Prefix caching (or prompt caching) addresses this: the server stores the KV results of a prefix, and when the next request has the same opening it simply reads them from cache instead of recomputing. The effect is twofold — it reduces both latency (TTFT for a cached prefix is near zero) and compute costs.
SGLang implements this via RadixAttention — an LRU cache of KV values organised into a radix tree, where the longest common prefix across requests is automatically found and shared. On workloads with long repeated prefixes (RAG with the same context, a multi-turn chatbot with the same system prompt) the throughput improvement is measurable.
On the cloud API side, providers implement an analogous feature: automatic prompt caching reduces the cost of input tokens on repeated prefixes by roughly 50–90%. This is relevant even for hybrid architectures where some requests go to the cloud — structuring prompts correctly (constant system prompt first, dynamic content at the end) can substantially reduce your bill. The article Prompt caching and cost covers this in detail.
How to get more out of a single GPU without buying hardware
A summary of the practical levers we see in deployments — in order of typical impact:
- 1.Switch to a production serving framework — if you are running
Ollamawith multiple concurrent users, moving tovLLMorSGLangis the single most impactful change you can make. Continuous batching and PagedAttention are built in.
- 1.Set the right quantisation — Q4_K_M or AWQ 4-bit significantly reduces the model's memory footprint compared to FP16 (typically to a third or a quarter), freeing VRAM for larger batches, with quality loss below 5–8%. More on formats in the article LLM Quantisation (GGUF, AWQ, GPTQ).
- 1.Enable KV-cache quantisation — FP8 or INT8 KV-cache halves the memory footprint with minimal impact on outputs. In
vLLMthis is the--kv-cache-dtypeparameter.
- 1.Optimise for your context — if your requests on average use only 20% of the configured maximum context, reducing
--max-model-lenfrees VRAM for more parallel requests.
- 1.Use prefix caching — if your system prompt or RAG context is identical across requests, route them to a framework with RadixAttention (SGLang) or enable prefix caching in
vLLM.
- 1.Monitor real GPU utilisation —
nvidia-smi dmonand the serving framework's own metrics will show where the bottleneck is. Low GPU compute utilisation combined with high latency typically signals a KV-cache management problem or suboptimal batching.
When multiple GPUs make sense
Horizontal scaling (tensor parallelism across multiple GPUs) is justified in two situations: the model physically does not fit on a single GPU even after quantisation, or you have exhausted all single-GPU optimisations and throughput is still insufficient.
Tensor parallelism (--tensor-parallel-size 2 in vLLM) splits the model's matrices across GPUs — each GPU processes part of the computation and results are synchronised. It scales nearly linearly for the prefill phase, less efficiently for decode (due to synchronisation overhead).
For most enterprise deployments running one model with dozens of concurrent users, a well-configured single GPU with the right serving framework outperforms two GPUs running Ollama — both in throughput and economics. Buying hardware before optimising software is a common but unnecessarily expensive mistake.
For guidance on which GPU to choose and how much VRAM you need for a given model and workload, see the practical reference in the article Which GPU for LLM Inference.
Frequently asked questions
What is TTFT and why does it matter?
TTFT (Time to First Token) is the time from sending a request to when the server begins streaming the first tokens of the response. For interactive applications (chatbots, copilots) TTFT is the decisive factor for perceived responsiveness — a user who receives the first token within 1–2 seconds perceives the system as responsive even if the complete response takes 10 seconds. Long context and a full KV-cache increase TTFT because the prefill phase takes longer.
What is the difference between PagedAttention and continuous batching?
They are two orthogonal optimisations that solve different problems. PagedAttention addresses memory management — it reduces KV-cache fragmentation by allocating it dynamically in pages rather than reserving capacity upfront. Continuous batching addresses request scheduling — instead of waiting for a batch to finish, it adds new requests to the current run on an ongoing basis. Both work together and are implemented in production frameworks such as vLLM.
Is prefix caching worthwhile for smaller deployments?
Yes, if you have a consistent system prompt or RAG context that repeats across requests. Even with a dozen concurrent users, prefix caching can reduce TTFT by 40–70% for requests with a long cached prefix. On the cloud API side it directly affects costs — a correctly structured prompt with a constant prefix can save tens of percent on input tokens.
When does it make sense to move from Ollama to vLLM?
When you have more than one concurrent user in a production environment. Ollama is excellent for developer desktops, prototyping, and single-user local access. With multiple parallel requests, continuous batching and PagedAttention in vLLM will deliver substantially higher throughput for the same hardware cost. The migration is relatively straightforward — vLLM provides an OpenAI-compatible API, so changing the URL and key is usually sufficient.
How does context length affect throughput?
Linearly and significantly. A longer context means a larger KV-cache per request, which directly reduces the number of requests that fit into available VRAM at once. A model with a 1M context window requires on the order of a hundred gigabytes of KV-cache for a single long request — unachievable on typical production GPUs. For most real-world use cases, RAG with a shorter context is substantially more cost-effective than a long context window filled with an entire document. More on the decision between RAG and long context in the article Context window — when 1M tokens actually helps.
*MP Industrial Solutions helps companies design LLM serving architectures suited to their workload and budget — from choosing a serving framework to configuring KV-cache and quantisation. If you are dealing with a slowdown in an existing deployment or planning a production rollout, we are happy to look at your specific numbers and propose optimisations.*
