When we first deployed an AI agent for a manufacturing client — the task: automate report assembly from multiple systems — we naively launched a simple loop with a single LLM and a series of tools. It worked… as long as nothing went differently than expected. The moment one tool returned an unexpected output, the agent got stuck in a loop and spent 40 minutes generating tokens before we manually stopped it. The problem was not the model. The problem was the architecture — not designed for real-world conditions.
Today there are three dominant patterns for AI agent architectures: ReAct, Plan-and-Execute, and reflection. Each has its place; each has its limits. This article is a decision framework — when to choose which pattern, what to expect from it, and where it will get you into trouble.
Why architecture choice matters
Choosing an architecture is not just a technical detail. It directly affects four metrics that matter in production:
- Reliability — how the agent responds to unexpected tool outputs or errors
- Cost — how many tokens it consumes per task
- Latency — how quickly the user gets a result
- Debuggability — how easily you can find where and why the agent failed
An important clarification: ReAct, Plan-and-Execute, and Reflexion are academic patterns from 2022–2023 (Yao et al. 2022 for ReAct). LangGraph, AutoGen, CrewAI, and other frameworks *implement* these patterns — but they did not invent them. Framework choice is a secondary decision. The primary decision is pattern choice.
ReAct: the Reason-Act-Observe loop
ReAct (Reasoning + Acting) is the simplest and most widely used pattern. The agent works in a loop:
- 1.Reason — based on current state and history, the LLM "thinks" (generates text describing its reasoning)
- 2.Act — selects and calls a tool with specific parameters
- 3.Observe — receives the tool output, adds it to context
- 4.Repeats until the goal is reached or a final answer is produced
ReAct's strength is adaptability. If something changes mid-run — a tool returns an unexpected result, an API changes its format — the agent sees this in the Observe phase and can adjust course. Not according to a pre-written plan, but reactively.
Where ReAct works well
- Tasks with 2–4 steps and a clear goal (find X, process Y, write Z)
- Interactive tools where context changes in real time
- Prototyping — ReAct is the fastest to implement
- When you want the simplest possible audit log (every Reason-Act-Observe step is visible)
Where ReAct fails
- Long tasks with many steps — each new step adds the entire history to context; costs grow quadratically
- Parallel tasks — ReAct is sequential; it cannot execute multiple branches simultaneously
- Without a `max_steps` limit it is dangerous — that is not an exaggeration. An agent without a configured step limit can get stuck in a loop overnight and generate thousands of pointless API calls — and with them a bill in the thousands of euros. We see this in practice; it is not a hypothetical risk.
An important correction to a widespread myth: the word "Reason" in ReAct does *not* mean the agent actually reasons correctly. The Reason step is a language output — the LLM generates text that looks like reasoning. It can be full of errors, and the model gives no signal to that effect. Unverifiable "thinking" is one of the most common causes of silent ReAct agent failures.
ReAct cost profile
A typical ReAct task consumes 2,000–3,000 tokens for 3–4 steps. At 8+ steps, consumption can reach 10,000–20,000 tokens per task, because the entire history is repeated in every context window. On a production workload you will notice this on your invoice.
Plan-and-Execute: plan first, then execute
Plan-and-Execute (P&E) splits the agent into two phases:
- 1.Planner — receives the goal and generates an explicit step-by-step plan (e.g. "step 1: fetch data from API, step 2: transform format, step 3: save to database, step 4: send report")
- 2.Executor — carries out the steps according to the plan, one by one, without re-planning
The key advantage: the planner and the executor do not have to be the same model. This is economically interesting — an expensive, powerful model (e.g. a Claude Opus-class or GPT-class model) handles planning, where quality matters. A cheap, fast model handles execution, where speed and cost matter. For long tasks with N steps this can be cheaper than pure ReAct.
Where P&E works well
- Long, structured tasks (5+ steps with clear logic)
- Batch processing — when you need to process 1,000 documents with the same procedure
- Auditable pipelines — the plan is an explicit document; the client sees it before execution begins
- Resistance to prompt injection — research suggests P&E has some inherent advantage here: the executor does not have access to the planning instructions, making it harder for an attacker embedded in the data to redirect it (this is not a complete protection, however)
Where P&E fails
Basic P&E has a fixed plan. When conditions change mid-run — an API returns a different format, step 3 fails — the agent either continues according to the stale plan or gets stuck. Dynamic replanning (re-planning after a failure) is an extension, not a built-in property of P&E. You must implement it explicitly.
Another myth to correct: P&E is *not always more accurate* than ReAct. For dynamic environments where conditions change mid-run, ReAct is more adaptable. P&E is better where the environment is predictable and the plan will remain valid through to completion.
P&E cost profile
A typical P&E task consumes 3,000–4,500 tokens (including the planning phase). That looks more expensive than ReAct at first glance. But for tasks with 8+ steps the total consumption is lower — the executor works with a shorter context than a ReAct loop that accumulates the entire history. Recommendation: for short tasks (2–4 steps) choose ReAct; for long tasks (6+ steps) consider P&E.
Reflection and self-critique
Reflection (Reflexion, Shinn et al. 2023) adds a self-evaluation layer on top of ReAct or P&E:
- 1.The agent completes a task (ReAct or P&E)
- 2.A reflection module evaluates the result — "was the goal achieved? Where did errors occur?"
- 3.Based on the reflection, the agent's memory is updated or the step is retried with a revised approach
Example results: on CodeContests tasks, GPT-4 went from roughly 19% success with a simple prompt to around 44% with an iterative test-driven approach (the published AlphaCodium flow). That is a real gain — precisely for tasks with deterministic verification.
Where reflection helps
- Coding tasks with deterministic verification (code either passes tests or it does not)
- Iterative outputs — when a first draft is never sufficient and you want automatic improvement
- Complex analyses where you need a second look at the result
Limits of reflection
A critical limit: when the same model generates both the output and the critique, reflection tends to repeat the same errors. The model cannot see the gaps in its own reasoning. Replication studies from 2025 confirmed this — single-agent Reflexion in iterative cycles converges to the same (incorrect) answer. The solution is either a different model for verification, or an external deterministic verifier (tests, a linter, schema validation).
More on how guardrails for AI agents help catch these failures before they reach the user.
When a single LLM call is enough
Before choosing an agent architecture, ask: *do I actually need an agent?*
We estimate that 40–50% of use cases where clients come in requesting an agent can be handled with one or two LLM calls and no loop at all. Examples:
- Document classification (one call, structured output)
- Field extraction from an invoice (one call + JSON mode)
- Report generation from prepared data (one call, long prompt)
- Summarisation with citations (one call + agentic RAG pipeline)
The rule: if you can fully define inputs, outputs, and the transformation in advance — you do not need an agent. An agent adds value when the steps are not known in advance or the environment is dynamic.
Related decision: chatbot vs copilot vs agent — where the precise boundaries between these concepts lie.
Decision framework
Rather than a table (which would oversimplify the trade-offs), here is a series of questions:
1. How many steps does the task have? - 1–3 steps → single LLM call or simple ReAct - 4–7 steps, predictable environment → Plan-and-Execute - 8+ steps with parallel logic → P&E + DAG executor or multi-agent
2. Does the environment change during execution? - No (batch processing, deterministic procedure) → P&E - Yes (interactive, real-time data, unpredictable APIs) → ReAct
3. Do you need an auditable output? - Yes, the client wants to see the plan before it runs → P&E - No, only the result matters → ReAct (shorter time to launch)
4. What is the cost of an error? - Low (internal tool, easily corrected) → ReAct with a max_steps limit - High (customer-facing, regulated domain) → P&E + reflection + human-in-the-loop
5. Do you have a dedicated team for maintenance? - No → Keep the pattern as simple as possible. ReAct with a clear max_steps and logging of every step. - Yes → You can afford P&E + dynamic replanning + reflection.
Hybrid architectures in practice
In production we almost never see pure ReAct or pure P&E. Common combinations:
- P&E + ReAct executor — the planner generates steps; each step is executed by a mini-ReAct agent with tool access
- ReAct + reflection — after the ReAct loop completes, a reflection module assesses output quality
- P&E + dynamic replanning — when a step fails, the entire plan is reconsidered, not just that step
LangGraph is today the de facto standard for implementing these hybrid patterns in Python. Alternatives such as AutoGen (Microsoft) or CrewAI are strong for multi-agent coordination, but LangGraph has the largest ecosystem and production stability.
For a broader view of what an agent costs in production — including per-call pricing and optimisation approaches — see the dedicated article.
Debuggability — the underrated dimension
Architecture also affects how easily you can find a bug. From experience:
ReAct: Every Reason-Act-Observe step is visible in the trace log. When the agent fails at step 7, you can see exactly what Observe input it received and what Reason it generated. Well debuggable — provided you log every step correctly.
P&E: The plan is an explicit document — you can easily see where the executor deviated from the plan. But errors in the plan itself (faulty planner logic) are harder to detect, because the planner runs once and its decisions are not incremental.
Reflection: The hardest to debug. Iterative cycles produce a lot of logs, and when the model repeats the same mistake in every cycle it can be difficult to identify the root cause.
More on how to configure observability for AI agents — tracing, logging, and alerting — in a production deployment.
Frequently asked questions
What is the ReAct architecture and where did it come from?
ReAct (Reasoning + Acting) is a pattern for AI agents designed by Yao et al. in 2022 (academic publication). The agent works in a Reason–Act–Observe loop: the LLM thinks, selects a tool, receives the result, and repeats. LangGraph, LangChain, and other frameworks implement this pattern but did not invent it.
When should I choose Plan-and-Execute over ReAct?
Plan-and-Execute is suited to long, structured tasks (6+ steps) in a predictable environment where you want an auditable plan before execution begins. ReAct is better for short, dynamic tasks where conditions change mid-run. For 1–3 steps, consider whether you need an agent at all.
Is Plan-and-Execute always more accurate than ReAct?
No. P&E has the advantage on structured, predictable tasks. For dynamic environments where conditions change mid-run, P&E fails — the agent continues according to an outdated plan. ReAct is more adaptable because it reacts to every Observe output.
Why is it dangerous to run a ReAct agent without a max_steps limit?
Without a configured step limit the agent can enter a loop (e.g. repeatedly calling a tool that returns an error, but the agent never treats this as a terminal state). An unconstrained agent like this can generate thousands of calls overnight and a bill in the thousands of euros. Every production agent must have an explicit max_steps or max_iterations parameter set.
Can I combine ReAct and reflection?
Yes, hybrid architectures are common in production. A typical combination: a ReAct loop for step execution, a reflection module after completion to assess output quality. Important caveat: reflection is reliable only when the output is evaluated by a different model or an external deterministic verifier (tests, schema validation) — not by the same model that generated the output.
*If you are considering deploying an AI agent in your organisation and are unsure which architecture suits your use case, we are happy to assess your specific situation. At MP Industrial Solutions we help clients choose an approach that matches their real requirements for reliability, cost, and team capacity — not the one that looks most impressive in a demo.*
