An AI agent running without constraints is like a racing driver on a test lap with no brakes. Most of the time it reaches the finish line just fine. But when something goes wrong — and it will — it goes wrong fast and expensively. In practice we have seen agents that looped overnight and racked up thousands of API calls and bills in the thousands of euros, agents that overwrote production data instead of test data, and agents that forwarded internal documents to an external endpoint while trying to "help". All of those incidents had one thing in common: no guardrails.
Guardrails are not an excuse or a blocker. They are operating rules — like circuit breakers in an electrical panel. When properly configured they do not slow production down. When they are missing, a single fault stops everything. This article breaks guardrails down into concrete layers and shows what to implement before you put an agent into production.
Why guardrails are not optional
An agent is not a chatbot. A chatbot responds — at worst it sends a poorly worded message. An agent acts: it calls APIs, writes to databases, sends emails, runs scripts, and reserves resources. Every action has a side effect in the world outside the LLM.
Most production incidents with agents did not happen because the model "hallucinated". They happened because the model correctly understood the task, but nobody constrained what it was allowed to do while carrying out that task. An attacker who knows about prompt injection (consistently ranked first in the OWASP LLM Top 10) can embed instructions inside external content that the agent reads and executes as legitimate commands. Without guardrails, the agent simply complies.
Guardrails are a multi-layer system — not a one-off output filter bolted to the end of the pipeline. Each layer catches a different class of failure.
Layer 1: Input validation
The first line of defence sits in front of the LLM itself. Before the agent receives a query, verify:
- Input length — define a maximum prompt length. Extremely long inputs can flood the context window, conceal injections, or slow processing.
- Sanitisation — strip or escape HTML, SQL special characters, and shell metacharacters. This applies equally to text the agent fetches from external sources (websites, emails, documents).
- Intent classification — a lightweight classifier (a small model or a regex ruleset is enough) detects whether an input looks like a manipulation attempt (prompt injection, jailbreak pattern). Tools such as
Lakera GuardorGuardrails AIhandle this out of the box, but even a hand-crafted ruleset of 20 patterns catches 80% of common attempts. - PII detection — if the agent processes user-supplied input, check whether it contains sensitive data (national ID numbers, IBANs) that should not leave your system.
In practice: input validation is cheap (milliseconds, no LLM tokens consumed) and intercepts trivial attacks before they ever reach the model.
Layer 2: Action allowlist
This is the most important layer for tool-using agents. Explicitly define which actions the agent is permitted to perform — and block everything else by default.
Concretely: every tool the agent can call must appear on an allowlist. It is not enough for the tool to exist in code — it must be registered for the current context and user role. Example structure:
read_document(id)— permitted for all userssearch_web(query)— permitted, but only for approved domainssend_email(to, body)— permitted, but only to internal addresses ending in@yourcompany.comwrite_database(table, data)— permitted only for non-production tablesdelete_record(id)— blocked; requires human approval
The allowlist is not only a security mechanism — it is also documentation of what the agent actually does. When an agent behaves unexpectedly, the allowlist is the first place to audit.
Scope per session: do not grant permissions globally. Every agent session receives the minimum scope it needs (principle of least privilege). An agent that analyses delivery notes has no business accessing HR data.
Layer 3: Runtime limits (max_steps, budget, timeout)
An agent looping without a ceiling is a recipe for an incident. Define hard limits:
- `max_steps` — the maximum number of steps (iterations) in a single run. Recommended range for production agents: 15–50 steps depending on complexity. On reaching the limit → graceful stop, not a crash.
- Timeout — every agent run must have a timeout. If an agent has been running for 10 minutes on a task that should take 30 seconds, something has gone wrong. A timeout catches it before the bill does.
- Token / budget limit — assign each session a maximum token count or maximum API cost. For agents with variable run length this is a more effective ceiling than
max_stepsalone. A cap of $0.50–$5 per session is a sensible maximum for most use cases. - Retry cap — tool calls fail. Retry logic is necessary, but without a ceiling (e.g. max 3 attempts per tool call) an agent can generate hundreds of calls to the same broken endpoint.
LangGraph (the de facto standard for production stateful agents) implements these limits natively through graph configuration — max_recursion_limit, checkpointing, interrupt bodies. Configuration takes minutes, not days.
Layer 4: Sandbox and isolation
If an agent runs code, executes shell commands, or works with the filesystem, isolation is a requirement — not an optional feature.
- Containerisation — code the agent generates and executes must run inside an isolated container (Docker, gVisor) with restricted permissions. Not on the host system.
- Read-only filesystem — mount production data as read-only. The agent can read; it cannot overwrite.
- Network isolation — if the agent does not need internet access, block it. If it does, define an allowlist of domains (not open internet).
- Ephemeral environment — create a clean sandbox for every run and destroy it afterwards. No shared state between sessions.
For agents that execute code in industrial environments (PLC scripting, network device configuration) this layer is critical — a mistake can have a physical impact.
Layer 5: Human-in-the-loop (HITL) and approval gates
Not every agent action requires human oversight — that would defeat the purpose of automation. But some actions do, and you must identify them explicitly.
Typical high-risk actions requiring approval: - Financial transactions above a defined threshold - Sending external communications (emails to customers, messages to partners) - Changes to production databases or configurations - Actions with irreversible consequences (deletion, deployment)
LangGraph implements HITL via interrupt() — the agent pauses, waits for human input (approve / reject / correct), then continues or exits based on the response. This is not "slowing things down" — it is a kill-switch for critical actions.
For routine steps use a Human-on-the-loop approach: the agent runs autonomously, but every step is logged and available for real-time review via an observability platform. A human intervenes only when they receive an alert or want to audit a run.
The EU AI Act (Art. 14) makes human oversight mandatory for high-risk AI systems from 2 August 2026 — HITL implementation is not only good practice, it is a legal obligation for the relevant category of systems.
Layer 6: Output validation
The output filter is the last safety net before the agent's result leaves the system or triggers the next action.
- Schema validation — if the agent produces structured output (JSON, XML, parameters for a downstream system), validate it against a schema before using it. Malformed JSON can crash a downstream system.
- Content filter — detect potentially harmful content in the response (PII, sensitive company information, unwanted patterns).
- Faithfulness check — if the agent works with documents (RAG), verify that the answer is grounded in the source material rather than generated from thin air. Simple heuristics (citation presence, confidence score) or LLM-as-judge catch the worst hallucinations before the result is delivered.
- Format and length — if the output feeds a downstream system, verify that it does not exceed the system's limits (max tokens, max file size).
Tools such as NeMo Guardrails (NVIDIA) or Guardrails AI let you define these validations declaratively — you add a YAML rule definition rather than additional code in the pipeline.
Kill-switch and emergency stop
Even when all layers are correctly configured, you need a mechanism to stop an agent immediately. A kill-switch is not paranoia — it is a failsafe you will (if everything goes well) never have to use.
Minimum kill-switch implementation:
- Per-session kill — every agent run has a unique ID; an operator can stop the session via API or dashboard
- Global pause — a toggle that stops all agents simultaneously (forced update, security incident)
- Circuit breaker — automatic kill if the error rate exceeds a threshold (e.g. 20% failed tool calls within 5 minutes → agent stops and waits for manual restart)
- Resource monitor — if the agent consumes more than X tokens or Y euros during a run, automatic stop plus alert
The kill-switch must work even when the agent is stuck in a loop — which means it must not be implemented purely as a step inside the agent loop. It must be an external mechanism (a watchdog process, Kubernetes job termination, systemd unit stop).
How to layer defences in practice
Guardrails are not a single module — they are a combination of layers that complement one another. No individual layer is one hundred percent effective; their combination dramatically reduces risk:
- 1.Input validation → intercepts injections and nonsense before the LLM
- 2.Action allowlist → limits what the agent can do
- 3.Runtime limits → prevents uncontrolled cost escalation and infinite loops
- 4.Sandbox → isolates side effects
- 5.HITL approval → stops high-risk actions before execution
- 6.Output validation → catches the last remaining issues before impact
- 7.Kill-switch → allows everything to be halted in an emergency
In a real deployment you do not implement everything at once. Prioritise based on the risk profile of the use case: an agent doing internal document search primarily needs input validation and an output filter. An agent that places orders or sends contracts needs the full stack.
Before every production deployment we recommend red-teaming: a systematic attempt to break the agent by your own team. Guardrails that nobody has tested are guardrails on paper only.
Guardrails and observability — two sides of the same coin
Guardrails protect you from an incident. Observability tells you why one nearly happened — and what to fix. Without tracing and logging at the level of each individual agent step you have no data on which to improve your guardrails over time.
Platforms such as LangSmith, Langfuse (self-hostable on Postgres + ClickHouse), or Arize Phoenix capture a node-by-node state diff for every agent run. When guardrails fire, you can see exactly which input triggered them — and decide whether the intervention was correct or a false positive.
This is an iterative cycle: guardrails collect incidents → observability analyses root causes → guardrails are refined. Without this cycle, guardrails stagnate and, as the agent's scope grows, they stop covering new attack vectors.
More on human-in-the-loop architecture — and how to implement approval gates without turning an agent into a slower chatbot.
Frequently asked questions
Isn't a single output filter at the end of the pipeline enough?
No. An output filter catches bad outputs — but by that point the agent may have already performed dozens of actions with real-world side effects (API calls, database writes, sent emails). Guardrails must block actions before they are executed, not merely filter text after the fact.
Is prompt injection a real threat or just an academic problem?
A real threat in production systems. OWASP LLM Top 10 has ranked it first for a long time. The first documented zero-click attack via infected external content (EchoLeak, Aim Security, 2025) showed that an attacker does not even need to interact directly with the agent — infecting a document the agent will read is enough. Every agent that processes external content (websites, emails, third-party files) is a potential attack vector.
LangGraph interrupt() — how does it work technically?
interrupt() is a mechanism that pauses graph execution at a defined point, persists the entire state (checkpointing), and waits for external input. The input can arrive via an API call, a webhook, or a UI. Once the input is received, the graph resumes from the point of interruption — not from the beginning. This way the agent "remembers" the context from before the approval and continues seamlessly afterwards.
How do you decide which actions need human approval?
A simple framework: an action needs approval if it (1) is irreversible (deletion, deployment, sending), (2) has a financial impact above a defined threshold, (3) involves an external party (customer, partner, regulator), or (4) modifies production systems. Internal reads, searches, and draft generation generally do not require approval.
How do guardrails affect agent latency?
Input validation and the allowlist are practically free — milliseconds with no LLM involved. An HITL approval gate adds latency proportional to how quickly a human reviewer responds — seconds to minutes, but only for high-risk actions. Output validation (schema checks, heuristics) is fast as well. The only potentially slower layer is LLM-as-judge for faithfulness checking — but even here a cheap model (Haiku tier) is sufficient for the verification; a frontier model is not required.
Conclusion
*Guardrails are not a brake on AI agents — they are the conditions under which you can actually trust agents in a production environment. Without them every agent is an experiment, not a product. If you are considering deploying AI agents in your organisation or want to verify that your existing guardrails are adequate, we are happy to assess your specific use case and propose a layered defence matched to the real level of risk.*
