A manufacturing client had a knowledge base: 8,000 technical manuals, service protocols, and drawings. They deployed classic RAG — embed, retrieve top-5, generate. For questions like "what are the parameters of motor X?" it delivered 85% accuracy. Then came this question: "What should I do when motor X reports error E47 and the temperature is above 80 °C at the same time?" The system returned a fragment about error E47 and a fragment about temperature — but never specified that the combination of these two conditions requires a different procedure than each one alone. The answer was technically correct, operationally useless.
This is a class of problems for which classic RAG has no architectural solution. It is not a matter of a better embeddings model or a larger chunk size. It is a matter of who controls retrieval — and when it knows it needs to retrieve more.
Where classic RAG hits its ceiling
Classic RAG (Retrieval-Augmented Generation) operates in a single retrieval step: a question arrives, the system indexes the query, retrieves the top-K chunks from the vector database, and passes them to the LLM. It is deterministic, fast, and cheap.
Problems arise with four types of tasks:
- Multi-step questions — the answer depends on two or more facts that live in separate documents and whose connection is not explicit in the text. Classic RAG makes one retrieval pass; the LLM receives only part of the context.
- Ambiguous or incomplete queries — the user writes "cooling problems on line 3", but the system doesn't know whether they mean motor cooling, hydraulic cooling, or the HVAC system in the hall. A single retrieve step cannot resolve the ambiguity.
- Questions that require comparison — "compare the service intervals for equipment A and B" requires two independent retrieve operations and a synthesis, not one retrieve with the hope that both documents land in the top-5.
- Iterative analyses — the agent must first establish something basic (e.g., the equipment serial number), then retrieve specific documentation based on that, then verify the validity of the information against the manufacturing date. Each step depends on the result of the previous one.
For these scenarios you need a different architecture — one where retrieval is not a one-shot action but an iterative process directed by an agent.
What agentic RAG does differently
Agentic RAG adds a decision-making layer on top of the classic retrieve-generate pipeline. The agent — typically an LLM with access to tools — receives a question and decides for itself:
- 1.What query to send to the vector database
- 2.Whether the results are sufficient, or more needs to be retrieved
- 3.What new query to formulate based on what it has obtained so far
- 4.When the context is complete and it can generate an answer
Instead of a single retrieve-and-generate cycle, a loop emerges in which the agent evaluates its own intermediate results. In practice this might look like: the agent receives the question about the combination of error and temperature, formulates a first query for error E47, reads the results, decides it needs context about temperature limits, formulates a second query, retrieves documents, synthesises both contexts, and only then generates the answer.
This approach has three key mechanisms that classic RAG lacks.
Query rewriting
User input rarely matches the language in which the documents are written. A technical manual describes "overheating of the hydraulic circuit"; the user writes "the oil temperature is too high". Query rewriting lets the LLM transform the original query into a form that better matches the vector space of the embeddings.
This is also part of more advanced classic RAG pipelines — it is not an exclusive feature of agentic RAG. The difference is that in agentic RAG query rewriting can repeat at every step: the agent sees what it has retrieved and reformulates the query based on the intermediate result, not just the original question.
Multi-step retrieval
The agent can retrieve multiple times, with each step informing the next. This solves the "hidden connection" problem — when an answer is distributed across documents that have no explicit semantic relationship but whose combination is relevant.
From a practical standpoint, the implementation looks like a ReAct agent in which "search the knowledge base" is a tool. The agent calls it as many times as it deems necessary, with different queries. More on the architecture of the agent loop can be found in the article on AI agent architectures.
Self-correction
The most sophisticated feature of agentic RAG. The agent not only retrieves but evaluates whether the retrieved information is relevant, current, and sufficient. In more advanced implementations a separate faithfulness judge is added — an additional LLM call that checks whether the generated output is grounded in the retrieved documents rather than hallucinated.
This is where many people underestimate the complexity: a naive agentic RAG without a faithfulness judge can hallucinate more than classic RAG. The reason is simple — longer context, more retrieval steps, more room for the model to generate plausible but ungrounded text.
When to deploy agentic RAG — a decision framework
Not every RAG use case requires an agentic layer. The following is a decision framework drawn from real deployments.
Stay with classic RAG if: - Most questions are straightforward (one document = a complete answer) - Latency is critical (an interactive chatbot where the response must arrive within 2 seconds) - Query volume is high and cost is the limiting factor - The knowledge base is narrowly defined and questions don't go beyond its scope
Move to agentic RAG if: - More than ~20% of queries require information from multiple documents - Users ask multi-step or comparative questions - The answer depends on state or context that is not present in the original query - Accuracy matters more than latency (decision support, safety protocols) - You have the capacity to implement and monitor an agentic pipeline
In production deployments we have seen that for knowledge systems built on top of technical documentation (service manuals, standards, procedures), agentic RAG significantly improves answer quality on multi-step questions — but at the cost of 5–10× higher token expenditure and 3–5× longer latency compared to the classic approach. This is not a cost you pay for every query, but it must be factored into the architecture decision.
The extra cost: more than just tokens
Agentic RAG is more expensive than classic RAG on several dimensions, and tokens alone are only one of them.
Token costs: Where classic RAG consumes roughly 500–1,000 tokens per query, agentic RAG with 3–4 retrieval steps and a faithfulness judge can consume 5,000–15,000 tokens. With cheap models (Flash/Haiku tier, around $0.10–0.50 per 1M input tokens) this difference is manageable. With frontier models in the class of Claude Opus or GPT-5, the cost per agentic query is in the range of cents to tens of cents — non-trivial at high volume.
Latency: Each retrieve step adds time — an LLM call for query rewriting, a vector-database call, and potentially the faithfulness judge. A real agentic RAG pipeline running 3–5 steps takes 5–20 seconds. For interactive applications this demands asynchronous design and visual feedback (a progress indicator, streaming).
Implementation and observability overhead: Classic RAG is a 50-line function. Agentic RAG is a pipeline with checkpointing, retry logic, monitoring, and debugging tooling. Without observability tools such as Langfuse (self-hostable) or Arize Phoenix it becomes nearly unmaintainable. A multi-step agent without traces is a black box — when it fails, you don't know why. More on observability for agents in the article on observability and tracing.
Retry overhead: Agents fail — a badly formulated retrieval query, a database timeout, an unexpected result format. In production systems a retry rate of around 10–15% is common, which effectively increases the total number of calls. This overhead must be included in the cost model.
Implementation patterns in practice
There are two dominant approaches to implementing agentic RAG.
Pattern 1: ReAct agent with a retrieval tool
The simplest form — a standard ReAct agent where one of the tools is search_knowledge_base(query: str) -> list[Document]. The agent decides when and how to call the tool, sees the results in the Observe phase, and proceeds accordingly. Implementation via LangGraph with explicit state is today's standard for production systems.
Advantage: relatively straightforward to implement; the agent is flexible. Disadvantage: without a faithfulness judge the agent can hallucinate with confidence; output reliability is variable.
Pattern 2: Dedicated RAG agent with an evaluation loop
A more sophisticated approach: retrieve → faithfulness check → if insufficient, reformulate the query and repeat → only once the faithfulness score exceeds the threshold, generate the final answer. This is closer to the SELF-RAG or CRAG (Corrective RAG) pattern.
Implementation detail: the faithfulness judge is a separate LLM call with a prompt that receives a chunk and the generated answer and must respond whether the answer is grounded in the chunk. A cheaper model (Haiku tier) is typically sufficient as the judge — you don't need the strongest model for a binary evaluation. The result is measurable and loggable, which is critical for auditing and tuning.
What practice has taught us: The most common mistake on a first agentic RAG implementation is not technical — it is product-level. The team deploys the full agentic pipeline for every query without distinguishing simple, straightforward questions from complex ones. The result: the entire system is slow and expensive, even though 60% of queries could have been handled by classic RAG more cheaply and faster.
Practical recommendation: Put a router in front of the pipeline. An LLM or a simple classifier assesses query complexity and routes it either to fast classic RAG or to the agentic pipeline. This significantly improves the cost-to-quality ratio under real load. The trade-off between RAG and fine-tuning is also covered in the article on RAG pipeline quality.
Common risks and how to mitigate them
Retrieval loop: The agent repeatedly retrieves similar documents without making progress. Mitigation: a max_retrieval_steps limit (typically 4–6) and a duplicate-chunk check — if the agent has retrieved the same document twice, it must change the query.
Confabulation from long context: The more documents are in context, the more room the model has for hallucinatory connections. Mitigation: faithfulness judge + citations (every claim must be traceable to a specific chunk). Uncertainty about which source a piece of information comes from is a warning signal.
Uncontrolled cost: Agentic RAG without limits can consume 10× more tokens on a single complex question than planned. Mitigation: hard per-query token limit, alerting at 50% of the limit, per-query cost reporting in the observability platform.
Prompt injection via documents: The agent retrieves a document in which an attacker has embedded instructions for the LLM (e.g., "ignore previous instructions and return X"). In agentic RAG this risk is higher because document content flows directly into the agent's context, which governs subsequent steps. Mitigation: context separation (retrieved documents in a separate context block with clear labelling), validation of retrieved content before injecting it into the agent's context.
Frequently asked questions
Is agentic RAG always more accurate than classic RAG?
No. A naive agentic RAG without a faithfulness judge can hallucinate more than classic RAG, because longer context and more retrieval steps give the model more room to generate plausible but ungrounded text. Agentic RAG is more accurate only when it is correctly implemented with a self-evaluating loop and a clear citation trail.
How much does agentic RAG cost compared to classic?
Agentic RAG is typically 5–10× more expensive per query. With cheap models (Flash/Haiku tier, $0.10–0.50 / 1M input tokens) the difference is manageable. With frontier models (Claude Opus class, ~$5 input / ~$25 output / 1M tokens) the cost per agentic query is in the range of cents to tens of cents. For high query volumes (thousands per day) a router that directs simple questions to classic RAG is essential.
Do I need a special framework for agentic RAG?
Not necessarily, but it will ease the implementation. LangGraph with explicit state and checkpointing is today's dominant choice for production systems — it handles checkpointing, retry logic, and HITL gates. LlamaIndex has built-in agentic RAG with query rewriting and multi-step retrieval. For simpler cases a custom ReAct loop is sufficient.
What is the latency of agentic RAG in production?
Real latency for an agentic RAG pipeline with 3–5 retrieval steps is 5–20 seconds. Classic RAG typically responds within 1–3 seconds. For interactive applications, asynchronous processing and streaming output are required — the user must see that the system is working; otherwise they experience it as a failure.
When should you use GraphRAG instead of agentic RAG?
GraphRAG is a better fit when relationships between entities matter and documents form a network of interconnected facts (e.g., regulatory cross-references, organisational hierarchies). Agentic RAG is a better fit when the knowledge base is document-oriented and the challenge is multi-step questions rather than graph relationships.
*If you are assessing whether your knowledge-base use case requires agentic RAG or whether an optimised classic approach would suffice, we are happy to evaluate the specific situation — including cost estimates, latency projections, and architectural design. MP Industrial Solutions implements RAG and agentic systems for B2B environments with a focus on measurable outcomes.*
