When we deployed our first agent for a heavy-industry client — the task was to automate the assembly of weekly reports from five systems — total token cost for the first month exceeded the original estimate by five times. Not because the agent was doing anything wrong. But because we had not thought through how many tokens each loop step consumes as history grows, as tool calls retry repeatedly, and as the entire conversation is appended to the context on every call. What cost cents per query in the prototype cost an order of magnitude more in production with dozens of users and thousands of daily tasks.
This article is not about hypothetical pricing. It is about where the money actually flows when you deploy an agent, why common estimates are so far from reality, and which optimisations work in practice — from shorter context to routing cheaper models to prompt caching.
Where an agent's token cost originates
Agents are more expensive than simple LLM calls for one straightforward reason: every loop step is a new API call, and the growing history enters the context each time. Compare this with classic RAG: one query, one call, one response. An agent with ten steps makes ten calls — and each one carries the entire preceding context.
The concrete cost topology looks like this:
- System prompt — repeated on every call. If your system prompt is 2,000 tokens and the agent takes 8 steps, you pay for that same text eight times.
- Conversation history — each Reason-Act-Observe cycle adds more tokens to the context. For long tasks, input tokens grow quadratically.
- Tool outputs — tool responses (database results, API responses, documents) are inserted into the context and increase its size. Extensive RAG retrieval before each step can double input tokens.
- Retry loops — if a tool call fails, the agent tries again. A retry rate of 10–15 % is common in production systems. Every retry is another call with the full context.
- Output tokens — agent reasoning (chain-of-thought, step-by-step thinking) generates long responses. Output tokens are typically 3–5× more expensive than input tokens.
The result: a real production agent consumes roughly 5–50× more tokens per task than you would estimate from a prototype.
Order-of-magnitude examples — what this means in euros
Model prices change every 1–3 months, so the figures below are indicative order-of-magnitude examples valid at the time of writing. Always verify current pricing directly with the provider.
Frontier models currently range from roughly $0.10 to $5 per million input tokens and from roughly $5 to $30 per million output tokens — a 50× spread between the cheapest and most expensive tier. That is the key number: model selection is the single most powerful cost lever, stronger than any other optimisation.
Consider an agent processing 100 tasks per day:
- Cheap tier (Flash/Haiku-equivalent, $0.10–0.30/1M input tokens): even at 50,000 tokens per task you are in the range of single-digit euros per day. A few hundred euros per month.
- Mid tier (Sonnet-equivalent, ~$3–5/1M input tokens): the same load can come to €1,000–3,000 per month.
- Premium tier (Opus-equivalent, ~$15+/1M input tokens): the same load can be 5–10× more expensive than the mid tier.
If the system additionally runs at a 12 % retry rate, you are effectively paying 1.12× per task — which adds up quickly at scale.
Human escalation is another hidden cost: if the agent fails on 5 % of tasks and each escalation to a human operator costs €2–10 (time), that cost can exceed the token cost itself.
Four places where costs can be cut effectively
1. Shorten context — systematically, not by accident
The biggest saving is not a cheaper model, but fewer tokens per call. Concrete steps:
- Compress the system prompt. Every redundant line in your system prompt is paid for hundreds of times a day. Audit regularly for what actually needs to be there.
- Summarise conversation history. Instead of inserting the full history into the context, let the agent create a summary after N steps.
LangGraphhas a built-in summarisation node for this. Some granularity is lost, but for most tasks that is acceptable. - Filter tool outputs. If RAG retrieval returns five documents but the agent needs only one, you are paying for four unnecessarily. Set up reranking and limit the number of fragments inserted into the context.
- Limit output length. Reasoning traces do not always need to be unconstrained. For standard steps a short answer is sufficient; leave detailed chain-of-thought only for genuinely complex decision points.
2. Route to cheaper models for simple steps
Not every agent step needs a frontier model. A typical distribution:
- Orchestration steps (what to do next, which tool to call): usually manageable by a cheaper model
- Data retrieval and filtering: cheap model or even deterministic logic
- Final response synthesis for the user: this is where a stronger model is worth it
- Verification and self-consistency checks: cheap model in a loop
LLM routing — dynamic model selection based on step complexity — is an optimisation we covered separately in our article on LLM routing and cascading. The same principle applies in an agentic context: you have a dispatcher that classifies each step and routes it to the cheapest model capable of handling it.
3. Prompt caching — pay for the system prompt once
Prompt caching is a technology offered in the APIs of Anthropic, OpenAI, and Google. The principle: if the beginning of the context (typically the system prompt or a long static document) repeats across calls, the provider caches it server-side and charges a fraction of the standard price for the repeated portion — roughly –90 % on the cached segment.
For an agent with a long system prompt running across dozens of concurrent sessions, caching is one of the fastest-to-implement savings. We cover this in more detail in our article on prompt caching and costs.
One condition: caching works well for stable parts of the prompt. If every call includes different dynamic content at the start, the cache hit rate will be low and savings minimal.
4. Set limits — before deployment, not after an incident
This is the point that technical literature mentions as a footnote, but in practice it is a law: an agent without hard limits is a financial risk.
max_steps(or equivalent): maximum number of loop steps before haltingmax_tokens_per_run: total token budget for a single tasktimeout: time after which the agent stops and returns a partial resultcost_budget_per_day: at the whole-system level, monitored with an alert
An agent without a max_steps limit can get stuck in a loop on an unexpected input or a failing tool dependency and generate thousands of calls overnight — with a bill in the thousands of euros. This is not hypothetical. We see it in production systems and it is always preventable.
Agentic RAG: where costs escalate quietly
Agentic RAG — RAG in which the agent itself decides how many times to retrieve and what to fetch — is 5–10× more expensive than classic RAG. Not because it is doing something wrong, but because it is doing more work.
A typical agentic RAG sequence for a single question:
- 1.Deciding which retrieval strategy to use (1 LLM call)
- 2.First search (retrieval + reranking)
- 3.Evaluating whether the results are sufficient (1 LLM call)
- 4.Optional additional search with a refined query (retrieval + reranking)
- 5.Synthesising the answer (1 LLM call with the full collected context)
Every step has a price. If you are also using a strong model for each of these calls, costs grow quickly.
Optimisation: use a cheap model for the decision steps (1, 3) and a strong model only for the final synthesis (5). And set a limit on the number of retrieval iterations — most questions are resolved in 1–2 rounds, not five.
More on the agentic RAG architecture in the article Agentic RAG — how it works in practice.
Batch API: when you do not need results immediately
If you have tasks that do not require a real-time response — offline document processing, overnight analytics runs, training data generation — the Batch API offers a significant saving. Anthropic (and other providers) offer –50 % on batch calls compared to the synchronous price, in exchange for processing with a delay of several hours.
Architecture: a front-end system collects tasks into a queue, a batch job runs overnight or during low-load periods, results are written to the database and are available by morning. For many industrial applications — report summarisation, documentation categorisation, PDF data extraction — this approach is financially an order of magnitude more advantageous.
Measuring costs: what to track in production
Token cost without observability is a black box. We recommend tracking at minimum these metrics:
- Cost per task type — different task types have different cost profiles. Without this breakdown you cannot know where to optimise.
- Retry rate per tool — which tool causes retries most often? Is it a tool error or a weak tool description?
- Average steps per run — if the average is 8 steps where you expected 4, you need to look at the architecture.
- Cache hit rate — if you have prompt caching, what percentage of calls actually hit the cache? Below 60 % means caching is configured poorly.
- Human escalation rate — what percentage of tasks end in escalation? If it is rising, the agent is probably failing on scenarios it was not designed for.
Tools such as Langfuse (self-hostable, framework-agnostic) or LangSmith (deep LangGraph integration) capture these metrics at the level of every node and call. Without observability you are flying blind — more on that in our article on AI agent observability and tracing.
Open-weight models as a cost alternative
For teams primarily concerned with cost, today's open-weight models are a real alternative. Families such as Qwen 3.x, DeepSeek, and Llama 4 achieve results on many agentic benchmarks comparable to frontier closed-source models — at zero or minimal API cost when run on-premises.
You pay for a GPU server instead of an API. For large task volumes (tens of thousands of daily calls) the break-even point is typically within months. For regulated environments where data cannot leave the company, on-premises is also a GDPR compliance requirement, not just a cost question.
Trade-off: on-premises requires engineering capacity to manage infrastructure, the serving stack (vLLM, SGLang), monitoring, and model upgrades. Not every team has that.
Frequently asked questions
Why is my agent far more expensive than I originally estimated?
The usual cause is a combination of three things: growing history in the context at each loop step, retry calls when tools fail, and long system prompts repeated on every call. A prototype typically tests 1–2 steps with static inputs. Production means real inputs, real tool and user errors, and users that drive the agent into unexpected states. We recommend profiling cost per run broken down into components (system prompt, history, tool outputs, output tokens) before you start optimising.
Is it worth implementing prompt caching?
If your system prompt is longer than 1,000 tokens and the agent runs across dozens or hundreds of sessions per day, yes — it is almost always worth it. Implementation is usually straightforward (a parameter on the API call), and the saving on the cached portion is roughly –90 % with Anthropic. The condition is a stable beginning to the context — if the system prompt changes on every call, cache hit rate will be low.
When does it make sense to switch from API to an on-premises model?
It depends on volume and compliance requirements. As a rule of thumb: if the monthly API bill exceeds the cost of a dedicated GPU server (including management and hardware amortisation) — typically somewhere in the range of several thousand euros per month — it is worth running the analysis. For regulated industries (healthcare, legal, finance) on-premises is also a GDPR compliance requirement, not just a cost question.
How should I set a sensible max_steps limit?
Start with profiling — how many steps does your typical task actually need? If the median is 4 steps and the 95th percentile is 8, set the limit to 12–15 with a safety margin. The limit should be high enough not to interrupt legitimate long tasks, but low enough to catch infinite loops. The combination of max_steps + timeout + a monitoring alert for runs exceeding 80 % of the limit is robust in practice.
What is the bigger cost — tokens or human escalation?
It depends on the escalation rate. If the agent escalates 1–2 % of tasks and human time costs €5 per incident, at 1,000 tasks per day that is €50–100/day in escalations — which can exceed the token cost of a cheaper model. For high-volume systems with a non-trivial escalation rate, reducing escalation (better agent training, clearer guardrails) is a better investment than saving on tokens.
*Agent costs in production are measurable and controllable — but only if we actually measure them. At MP Industrial Solutions we help companies design agentic systems with a clear cost profile from the outset, rather than optimising retrospectively after the first monthly bill comes as a surprise. If you are considering deploying an agent and want to build a realistic cost model before implementation, we would be glad to talk.*
