In most production LLM applications, the system prompt, contextual documentation, or conversation history makes up the bulk of the input text on every API call. If your system prompt is a thousand tokens long and you serve ten thousand requests a day, you're paying for ten million tokens — most of which is the exact same text every single time. Prompt caching solves this at the infrastructure level: the provider stores the KV representation of recurring prefixes and reuses it on subsequent calls instead of recomputing it.
This article explains how prompt caching works under the hood, under what conditions it meaningfully reduces both cost and latency, what specifically to cache, and when caching won't help — or will actually add cost. It also covers how to design prompts that cache well, a detail many teams overlook.
How prompt caching works
When you send a request to an LLM API, the model computes what's called a KV cache (key-value vectors) for every input token — this is the foundation of the self-attention mechanism through which the model evaluates relationships between tokens. That computation is not free: it's GPU work that directly affects both cost and time-to-first-token (TTFT).
Prompt caching (also called context caching or prefix caching) works as follows: if the beginning of a prompt — the prefix — is the same across multiple calls, the provider computes its KV representation once, stores it on the server, and on subsequent requests simply reads it back instead of recomputing it. From the caller's perspective this is fully transparent — responses are just as correct, but cheaper and faster.
What this looks like in numbers:
- Anthropic: cache read = 10 % of the standard input-token price (90 % saving); cache write = 125–200 % of the standard input-token price (slightly more expensive on first write)
- OpenAI: automatic, no code required; cache read = 50 % of the standard input-token price
- Real-world production savings: typically 50–70 % of total input-token costs for workloads with long, recurring prefixes
The economics are straightforward: you pay a little more on the first call (cache write), but every subsequent read of the same prefix is substantially cheaper. You break even on the second read; at thousands of calls per day, the savings are significant.
TTL and prefix consistency
A cached prefix doesn't last forever. Anthropic offers a TTL (time-to-live) of either 5 minutes or 1 hour — after expiry the cache is invalidated and the prefix is recomputed on the next call (another cache write). OpenAI manages TTL automatically with no explicit configuration.
Two practical consequences follow. First: if your workload produces requests with the same prefix in quick succession (seconds or minutes apart), caching works well. If requests are spread across the day with hour-long gaps and you're using a 5-minute TTL, the cache will invalidate regularly. Second: the prefix must be byte-for-byte identical. Even a small change — a different space, a differently formatted date in the system prompt — invalidates the entire cached representation and triggers a new cache write.
What makes sense to cache
Not every piece of text in a prompt is worth caching. The selection follows simple logic: cache what is long, static, and repeatedly used.
System prompt. The most obvious candidate. If you have 500–2,000 tokens of system instructions (role, context, behavioral rules, few-shot examples), these rarely or never change across most applications. Place them as a fixed prefix at the start of the prompt — caching will cover them automatically.
Shared documents and standards. If every request includes the same technical manual, ISO standard, company policy, or legal document as reference context, that's an ideal candidate. We've seen deployments where a 300-page technical manual was injected into every call to a customer chatbot — caching cut costs there by more than 80 %.
Long conversation history. In multi-turn chats, the application sends the entire previous history with each new request. The older portion of the history never changes — it's a growing, stable prefix. Caching works well here as long as the application formats the conversation consistently and doesn't reorder messages.
Few-shot examples. If your system prompt includes several long input–output examples (few-shot prompting), those are also static and repeatable. Include them in the cached prefix.
What is not worth caching:
- Short or one-off prompts — you'll pay extra for the cache write and never recover it
- Prompts where every request differs from the very beginning (e.g., generating variable reports with no shared prefix)
- Systems with inconsistent formatting — even a random reordering of parameters in the system prompt is enough to invalidate the cache
How to design prompts for caching
One of the most common oversights we encounter when reviewing existing applications: teams enable caching, but the prompt structure isn't designed to use it effectively.
The fundamental rule: static before dynamic. Caching covers the prefix — the start of the prompt. Therefore everything that changes per request (the user's specific question, variable parameters, the current date) must go at the end of the prompt. If you insert a dynamic element in the middle of static context, the prefix changes every time and caching never fires.
Example of the wrong order:
System prompt (static)
↓
Current date and username (dynamic — changes with every request)
↓
Technical manual (static, 800 tokens)
↓
User question (dynamic)In this example the dynamic element in the middle invalidates caching for the entire technical manual below it.
Correct order:
System prompt (static)
↓
Technical manual (static, 800 tokens)
↓
Current date and username (dynamic)
↓
User question (dynamic)The static block is now a contiguous prefix — the provider caches it and the dynamic elements don't affect it.
Stable formatting. If your system prompt is generated by code, avoid embedding variable values (timestamps, version strings) directly in the static portion. Instead of Current system version: 2.4.1 (deployed 2026-03-20), use Current system version: {version} and inject the value in the dynamic section — or better yet, in the user message.
On-premises deployments (vLLM, SGLang) follow the same principles. Both vLLM and SGLang implement prefix caching in their own way — SGLang via RadixAttention (an LRU tree of KV caches), vLLM via PagedAttention with prefix-aware block management. Both frameworks automatically recognise recurring prefixes and skip recomputation. The details differ, but the design principle — static before dynamic — applies equally. For more on choosing a serving framework, see vLLM vs SGLang vs Ollama.
When caching helps significantly — and when it doesn't
A simple test: calculate what percentage of your input tokens are a repeating static prefix. If it's above 50 %, caching is likely a worthwhile investment. If every request is unique, caching won't help.
High-saving scenarios:
- RAG with long context. Applications that inject the same shared documentation on every call (company policies, manuals, standards). The cached context is retrieved at 10 % of the usual price.
- Customer support with long system prompts. A typical production customer chatbot has a system prompt of 500–1,500 tokens describing the product, rules, tone, and examples. At thousands of calls per day, the savings are substantial.
- Multi-turn agents with accumulating history. Conversational sessions where history is written into context. The older portion of the history becomes a stable prefix — caching grows organically with conversation length. This is directly related to managing AI agent costs in production.
- Batch document processing with shared instructions. If you process thousands of invoices, service records, or orders with the same extraction instructions, the instructions are cached and you pay only for the documents themselves.
Low or zero savings scenarios:
- Generative one-shot tasks. Write an annotation for this video, translate this text, generate a report — every request is unique even in the system portion. You'll pay extra for the cache write and never recover it.
- Low-volume workloads. If your application makes a hundred calls per day, the absolute savings are small. The implementation overhead (prompt refactoring, hit-rate monitoring) may not pay off.
- Short system prompts. If your system prompt is 50 tokens, you don't even meet the minimum for caching (Anthropic requires a prefix of at least 1,024 tokens).
Measuring caching effectiveness
Deploying without measuring is navigating blind. Three metrics to track:
Cache hit rate. The ratio of requests where the prefix was loaded from cache versus total requests. Anthropic's API returns cache_read_input_tokens and cache_creation_input_tokens in the response. OpenAI returns cached_tokens in the usage object. Target: with correctly structured prompts, hit rate should be 80–95 % under a stable workload.
Average input tokens per request. When caching is working, the effective cost of input tokens falls — cache-read tokens are billed at a lower rate. Track the rolling average of input cost per request over time.
TTFT (Time to First Token). A cached prefix eliminates recomputation, which shortens TTFT. If you see a bimodal latency distribution (some requests faster, others slower), that's likely the result of the difference between cache hits and cache misses.
Incorporate these metrics into your observability pipeline — for example via structured logs or tracking in agent observability and tracing.
Comparing cloud providers
Approaches also differ in implementation details, which influences how you design for caching.
Anthropic (Claude): Explicit — in the request you must mark the blocks you want cached using the cache_control parameter. This gives you control over exactly what is cached. TTL of 5 minutes or 1 hour (configurable). Minimum for caching: 1,024 tokens. Cache write is 125–200 % of the standard input-token price; cache read is 10 %.
OpenAI (GPT): Implicit — automatic, no code required. The provider detects recurring prefixes and caches them on its own. Cache read = 50 % of the input-token price. Simpler to enable, less control over what is cached.
Google (Gemini): Explicit context caching via API, similar to Anthropic. Longer TTL — hours to days. Well-suited for very stable, long contexts (complete manuals, entire codebases).
On-premises (vLLM, SGLang): Prefix caching is built in and runs automatically for recurring prefixes. No extra charge for writes — compute costs are yours, but for repeated prefixes the serving framework saves GPU time, which increases throughput and reduces latency. This advantage is key in private deployments on your own hardware.
Frequently asked questions
Is prompt caching secure — does my data stay on the provider's servers?
Yes, the cached KV representation stays on the provider's servers for the duration of the TTL window. For most companies this is not an issue — API calls go there anyway. For organisations with strict data-residency requirements (hospitals, banks, regulated industries), on-premises serving with vLLM or SGLang is the better choice, where the cache remains fully under your control. More in on-premises LLM for regulated industries.
Can caching negatively affect response quality?
No — caching only affects how quickly the KV vectors for the prefix reach memory, not what values those vectors contain. The model receives identical context whether the prefix was freshly computed or loaded from cache. Responses are the same.
Is caching worth it for a small application with a few hundred calls per day?
It depends on system-prompt length and model price. For a 1,000-token system prompt and 500 calls per day with a frontier model, monthly savings are in the low single-digit euros — right on the boundary of whether the refactoring effort pays off. At 50,000 calls per day with a 2,000-token prompt, savings are in the hundreds of euros per month and above — at that scale the case is clear.
What if my system prompt needs to be updated every hour (for example, I'm injecting the current time or inventory status)?
That's the classic trap. If the dynamic value is inside the cached block, every change invalidates the cache. The fix: pull the dynamic value out of the cached prefix and put it in the user message or in an uncached block at the end of the prompt. The cached system prompt stays unchanged; the dynamic value sits alongside it.
Does prompt caching work with structured output (JSON mode)?
Yes. Caching is independent of output format. The cached prefix can be a combination of system instructions and a JSON schema, as long as it's static — the model generates against it and the cache hit rate is unaffected. More on structured output in structured outputs and JSON mode.
Conclusion
Prompt caching is not a revolution — it's an infrastructure optimisation that works quietly and reliably, provided you design your application to take advantage of it. Three things are key: identify the static prefix in your prompts, place it consistently at the beginning, and measure the hit rate. For workloads with thousands of daily calls and long shared contexts, this change can reduce your LLM API bill by tens of percent without touching any application logic.
*At MP Industrial Solutions, when reviewing clients' LLM applications, we regularly encounter prompt structures where caching isn't being used — or where a dynamic element in the middle of static context kills the entire effect. If you'd like us to look at your application and identify concrete savings, we'd be happy to meet for a consultation.*
