When model providers announced context windows in the millions of tokens, many CTOs and lead engineers reacted the same way: "Great — we'll load the entire documentation in and skip RAG entirely." It's an understandable reaction. Fewer components, less infrastructure, less tuning. The reality in production, however, is more complicated.
A large context window is a powerful tool — but it has physical and economic limits. This article explains where those limits lie, shows when 1M tokens genuinely solves a problem, and conversely, when using it is unnecessarily expensive and sometimes less accurate than a well-designed pipeline with agentic RAG.
What "1M tokens" actually means
The context window is the maximum length of text — input plus output — that the model sees at once in a single call. A token corresponds to roughly 0.75 words in English; in Slovak and Slovak-named technical texts, tokens tend to be slightly longer.
One million tokens is approximately 750,000 words — an entire average novel, hundreds of pages of technical documentation, or the complete source code of a medium-sized project. Today this is not a premium feature: Claude (higher tiers), Gemini 2.5/3.x, and Llama 4 Maverick offer 1M, and Llama 4 Scout goes as far as 10M tokens. Not every model is this "long", though — DeepSeek V3, for instance, has 128K. Even so, a 1M context is no longer an exceptional feature; it is an emerging standard.
The question is not whether a model can handle a long context. The question is what it costs and how response quality degrades as context grows.
How cost scales with context length
The computational cost of inference does not scale linearly with context — the reality is more nuanced, but for practical decision-making it is enough to understand one mechanism: KV cache.
Every token in the context requires storing so-called key-value vectors in GPU memory (VRAM). With every newly generated token, the model accesses the entire cache. For a 7B model, the KV cache at 128K tokens is roughly 16 GB; at 1M tokens it exceeds 100 GB. For a 70B model at 128K tokens, the KV cache alone is around 40 GB — and that is before the model weights.
The practical result: if you want to serve multiple parallel requests with a long context, you need either substantial hardware or you will queue requests. On a cloud API this simply means a higher price per call — frontier models charge per input token, so a long context directly translates into higher costs. In an on-premises deployment it means needing more or more powerful GPUs.
An important optimisation exists: prompt caching. If your system prompt, documentation, or other static content remains the same across multiple calls, the provider stores its KV representation, and on subsequent reads you pay only 10% of the input-token price (at Anthropic). A cache write is conversely slightly more expensive — 125 to 200% of the standard price. Savings materialise only when you read the same prefix repeatedly within a short time window. We cover this strategy in more depth in the article prompt caching and cost.
"Lost in the middle" — when the model loses the thread
Most engineers are aware of the hard technical limit of the context window. Less discussed is another, subtler boundary: context rot — the degradation of response quality as context length grows.
Research shows that models tend to concentrate attention on the beginning and end of the context. Information stored in the middle of a long document is represented less accurately in responses — a phenomenon called "lost in the middle." The needle-in-a-haystack benchmark (finding a single sentence hidden in a long text) measures only the ability to retrieve an isolated fact. Real production tasks are more complex — they require connecting multiple pieces of information from different parts of a document and reasoning across several steps. On such tasks, quality degradation sets in earlier and is more pronounced.
In practice this means: if you have a 500-page technical manual and ask about the dependency between section 3 and section 47, a model with the entire book in context does not necessarily answer better than a model that was handed only the two relevant sections via RAG. And the second approach is substantially cheaper.
When a large context genuinely helps
Despite these limitations, there are scenarios where a long context is objectively the better choice. The main categories:
Whole-document analysis with sequential dependencies. If you want to summarise the minutes of a three-hour meeting, extract decisions from a lengthy legal filing, or analyse a long technical report where the conclusion depends on context introduced in the introduction — having the entire document in context makes sense. Here, linearity is an advantage, not a weakness.
Long conversations with full history. Multi-turn chat systems where every question depends on the previous answer — customer support with history, technical assistance where a single solution is refined across dozens of exchanges. Keeping the entire history in context is simpler than implementing sophisticated memory management.
Code review or refactoring of an entire repository. When you need the model to fix a dependency across multiple files simultaneously, the context of the full codebase eliminates inconsistencies. This is exactly how modern coding agents work — Cursor, Claude Code, and similar tools.
A single long, infrequent query. If you upload a 300-page contract once a month and ask about liability clauses, the economics work. One call, a clear output, low frequency — the cost is acceptable.
When RAG is cheaper and more accurate
Most production LLM systems in companies work differently: a large knowledge base (technical documentation, ISO standards, internal regulations, service record histories) queried by tens or hundreds of employees every day. Here, a large context without RAG stops making economic sense.
Situations where RAG is the better choice:
- High call volume with overlapping sources. If a hundred operators query the same manual with different questions every day, RAG retrieves only the relevant passages and you pay for a fraction of the tokens.
- The database is growing. The context window is fixed — if your documentation runs to 10 million tokens, even a 1M window is not enough. RAG scales with the size of the knowledge base without changing the architecture.
- You need precise source attribution. RAG returns specific chunks with references. A long context gives you an answer, but without attribution to a particular paragraph — which matters in regulated environments.
- Low latency is critical. Time to first token (TTFT) grows with context length. At 128K tokens, latency is an order of magnitude higher than with a 4K context of relevant RAG chunks.
A more detailed comparison of both approaches can be found in the article RAG vs fine-tuning — decision guide, where we also cover combined scenarios.
The hybrid approach: context + RAG
In practice, the best production systems do not choose between "full context" and "pure RAG" — they use both depending on the situation. A typical architecture:
- 1.System prompt and global instructions — static, cached via prompt caching, cheap on repeated reads.
- 2.Dynamically retrieved chunks — relevant passages from documentation, added to the context based on the query. Typically 5–20 chunks, each 200–500 tokens.
- 3.Conversation history — the last N exchanges, with a summary of older history when needed.
- 4.Entire document in context — only when the document is short (under 50K tokens) and the task requires the whole thing.
This combination keeps the effective context size at 8–32K tokens for typical queries, which is an order of magnitude cheaper than 500K+ with a naïve "load everything" approach.
Practical decision-making: questions before deployment
Before committing to an architecture, answer these questions:
- How many times per day will the same knowledge base be used? If more than 50 times, RAG pays off economically even with higher setup costs.
- Is the base larger than 500K tokens? If yes, a long context is insufficient and RAG is the only realistic option.
- Does the answer depend on reading the entire document sequentially? If yes, a long context is a legitimate choice.
- What is the latency tolerance? A real-time conversation requiring a response under 2 seconds and a long context are hard to combine.
- Do you need citable sources? RAG gives you a chunk with a reference; a large context does not.
If you answer mostly "no, no, yes, high, no" — a long context may be the right choice. If the opposite prevails, a RAG or hybrid architecture will be better both economically and in quality.
Prompt caching as a middle ground
A special case arises when you have a relatively large but static context — for example a company style guide, an extensive system prompt with instructions, or reference documentation. Here, prompt caching offers an attractive middle ground:
- First call: you pay the full price for a cache write (125–200% of the standard input price).
- Every subsequent call within the TTL window (5 minutes or 1 hour at Anthropic): you pay 10% of the input-token price for the cached portion.
- With a static prefix repeated ten times, total costs are substantially lower than without caching.
Typical savings in production on a suitable workload reach 50–70% of input-token costs. Emphasis on "suitable workload" — if every user arrives with a completely different long prompt, the cache hit rate will be close to zero and the cache writes will instead increase costs.
What this means for hardware and serving framework selection
If you decide to use genuinely long contexts in an on-premises deployment, it is important to understand the hardware implications.
KV cache for a long context requires significantly more VRAM on top of the model weights themselves. For a 70B model at 128K context, the KV cache alone is around 40 GB — add to that approximately 140–168 GB for model weights in FP16 (or roughly 35–40 GB for Q4_K_M quantisation). This quickly outgrows a single GPU's capacity and requires a multi-GPU setup or tensor parallelism.
On the serving framework side: vLLM and SGLang implement PagedAttention and RadixAttention — techniques that reduce KV cache fragmentation and enable more efficient sharing across concurrently processed requests. Ollama is excellent for desktop use and experimentation, but for production multi-user serving with long contexts its throughput is an order of magnitude lower. More on framework comparisons can be found in vLLM vs SGLang vs Ollama.
Grouped Query Attention (GQA), implemented by most modern models, significantly reduces KV cache size compared to classic multi-head attention — when choosing a model for long-context workloads it is therefore important to verify GQA support.
Frequently asked questions
Will a million tokens replace RAG entirely?
For most production systems, no. The economics don't work: 1M tokens on every call is an order of magnitude more expensive than retrieving relevant chunks. In addition, quality degradation at very long context lengths is real. A large context makes sense for one-off analysis of long documents, not for a system where the same knowledge base is queried a hundred times a day.
How much does response quality drop with a long context?
It depends on the task. For simple single-fact lookup (needle-in-a-haystack), modern models are reliable even at 1M tokens. For more complex tasks — connecting information from multiple locations, multi-hop reasoning — accuracy drops more significantly as length increases. The "lost in the middle" phenomenon is experimentally documented and must be accounted for in architecture design.
Is prompt caching always beneficial?
No. It is beneficial when a static or rarely changing prefix is read repeatedly within the TTL window (5 minutes or 1 hour). Cache write tokens are more expensive than standard input tokens (1.25–2× at Anthropic). If every user brings a unique long context, or call frequency is low, the cache hit rate will be low and total costs may be higher than without caching.
What hardware do I need for local deployment with a long context?
It depends on the model and context length. As a rough guide: for a 70B model at 32K tokens you need around 50–80 GB of VRAM in total (weights + KV cache at Q4 quantisation). At 128K tokens the KV cache alone is roughly 40 GB (for 70B, FP16 KV). For a 7B model the situation is more favourable — at 128K tokens you can manage on a high-end GPU with 40–80 GB VRAM. Apple M-series with unified memory (up to 128–192 GB on M5 Ultra) is an interesting on-premises alternative for 70B with a long context.
Do the same principles apply to open-weight on-premises models and cloud APIs?
The economics differ; the principles are the same. On a cloud API you pay directly per token. On-premises you pay for hardware (CAPEX) and energy (OPEX) — but KV cache and therefore GPU memory requirements grow in exactly the same way. The "long context vs RAG" decision therefore rests not only on token price but on the overall hardware architecture, batch size, and latency.
*If you are designing an LLM system architecture and are unsure where the boundary between long context, prompt caching, and RAG lies for your use case, MP Industrial Solutions is happy to review your specific situation — we bring experience from industrial deployments where every euro of operating cost matters.*
