Human-in-the-Loop: When to Let an AI Agent Act on Its Own

Q: What if the agent needs approval in the middle of a long task — will it lose context?

That is exactly why `LangGraph` checkpointing matters. The agent serialises its state at every interrupt point. If approval arrives an hour or even a day later, the agent resumes the flow from the exact point without losing context. Without checkpointing you would have to restart the entire task.

One of our clients — a mid-sized manufacturing firm running an ERP — deployed an AI agent to handle purchase orders. The first three weeks went smoothly: it grouped requests, verified availability, and generated proposals. Then came a case that wasn't in the training data — a customer with a non-standard payment term. The agent estimated the situation was routine and confirmed the order. Without approval. Without escalation. With the wrong terms.

It never occurred to the agent to stop and ask. It had no rules for that.

This isn't an argument against AI agents. It's an argument for deliberate human-in-the-loop (HITL) design — a set of rules that determine when an agent acts on its own and when it waits for a human. This article is a practical decision framework: where to draw the line, how to technically implement an approval flow, and how to expand agent autonomy gradually and safely.

Why HITL is not just a "safety patch"

The most common mistake in agent design: HITL gets bolted on at the end as a filter. The agent's output — a decision, an action, a document — passes through an approval window, and if a human clicks "OK," it proceeds.

That approach only treats the symptom. You can't see what the agent was doing before. You don't know whether the decision flowed logically from the right steps or whether the agent reached the right answer by the wrong route. And when something goes wrong, the audit trail is empty.

Proper HITL is an architectural pattern, not an add-on. You define it before deployment, not after the first incident. It includes:

Action classification by level of risk and irreversibility
Explicit interrupt points in the agent graph — not just at the output, but directly in the decision flow
Audit trail for every step: what the agent saw, what it decided, who approved

EU AI Act Article 14 requires human oversight for high-risk AI systems from 2 August 2026. If your system falls into this category — automated decision-making in employment matters, credit scoring, management of critical infrastructure — HITL is not a recommendation. It is a legal obligation.

Actions that always require approval

Not every agent action carries the same risk. The key criterion is irreversibility: can the action be undone? If not, approval is mandatory.

The second criterion is scope of impact: who does the action affect, how much money or data is at stake, what is the regulatory exposure?

Actions where we always recommend stopping and requesting approval:

Financial transactions — sending a payment, issuing an invoice, changing a price list, placing a supplier order above a defined limit
Irreversible data operations — deleting records, backing up before overwriting, exporting to an external system
External communications on behalf of the company — email to a customer, response to a complaint, legal document
Changes in third-party systems — writes to ERP, CRM, SCADA; changes to production environment configurations
Escalation decisions with legal exposure — rejecting a claim, terminating a contract, accepting a liability

In practice, we see companies underestimate these categories at the start. While the agent is only working in an internal sandbox, everything looks safe. Problems arrive the moment the agent gains access to production tools.

Human-on-the-loop vs. Human-in-the-loop

An important distinction that will shape the entire design:

Human-in-the-loop (HITL) — the agent stops and *waits* for approval before executing an action. The flow is blocked. Response time depends on operator availability. Suitable for irreversible or high-risk actions.

Human-on-the-loop (HOTL) — the agent acts autonomously but logs every step, and an operator can intervene asynchronously. The flow is non-blocking. Suitable for actions that are easily reversible or have low impact.

Most production systems need both modes. A simple decision tree:

1.Is the action irreversible? → HITL mandatory
2.Is the financial impact above a threshold (e.g. 1,000 EUR)? → HITL mandatory
3.Is there regulatory exposure (GDPR, contract, liability)? → HITL mandatory
4.Is the action routine and reversible? → HOTL is sufficient

This is not a one-time decision. Thresholds change as the agent builds a history and earns trust.

Technical implementation: LangGraph interrupt()

The cleanest technical implementation of HITL for stateful agents is via LangGraph and its interrupt() mechanism. The principle is straightforward: in the agent graph you define nodes where the flow pauses and waits for external input — approval, correction, rejection.

The basic pattern:

The agent processes the input and decides what action it wants to take
Before executing the action it reaches an interrupt node: it saves the current state (checkpointing), generates a description of the proposed action for the operator, and waits
The operator receives a notification (email, Slack, integrated UI panel)
On approval the flow resumes from the exact checkpoint — the agent remembers the full context, not just the last message
On rejection or modification the agent receives feedback and can replan

The key feature of LangGraph checkpointing: the agent's state is serialised to persistent storage. Approval can arrive an hour later or the next day — the agent "wakes up" with full context. This is a fundamental difference from the naive approach where you would have to implement the interrupt yourself via a database and Redis.

For a trustworthy audit trail, every interrupt node records: agent identifier, timestamp, proposed action, tool arguments, and later also the operator's decision and reasoning.

More on what agentic architectures look like in practice is covered in the article AI Agent Architectures: ReAct, Plan-and-Execute — When to Use Which.

Designing the approval flow

A good approval flow has three properties: context-richness (the operator sees *why* the agent is proposing this action, not just *what*), response speed (the longer the agent waits, the more blocked tasks pile up), and escalation (if the operator doesn't respond within a defined time, the task escalates or safely terminates).

Recommended notification structure for the operator:

Context: What informs the decision — the original request, relevant data, steps preceding the action
Proposed action: Exactly what the agent wants to do, including arguments (e.g. "Order 500 units of SKU-4421 from supplier X at price Y")
Reason: Why the agent reached this decision
Alternatives: If the agent considered multiple options, show them (helps the operator when correcting)
Actions: Approve / Reject / Edit with comment / Escalate

This is not just a UX detail. An operator who sees only "Confirm action?" approves mostly by reflex. An operator who sees the full context makes an informed decision — and simultaneously builds intuition about where the agent makes mistakes.

We cover security and guardrails for agents in more depth in the article Guardrails for AI Agents.

Trust vs. speed: the core trade-off

HITL has a cost. Every interrupt point adds latency — waiting for a human response can take minutes or hours. If an agent has ten steps and each one waits for approval, the system is not an agent. It is a form-based workflow with an AI façade.

That is why trust in an agent is a scalable variable, not a binary decision.

A practical framework for gradually expanding autonomy:

Phase 1 — Shadow mode: The agent proposes actions, but a human always executes them. Data is collected: how often was the agent correct, where did it err, what was the context of errors. Duration: 2–4 weeks or 100–200 actions.

Phase 2 — Selective autonomy: Based on data from Phase 1, you define action categories where the agent had > 95% accuracy and release those to HOTL mode. You still monitor. High-risk categories remain HITL.

Phase 3 — Bounded autonomy with limits: Autonomy in low-risk categories with explicit limits (e.g. orders up to 500 EUR without approval). Every limit has a justification and a review date.

Phase 4 — Adaptive autonomy: The system monitors performance in real time. If the error rate rises above a threshold, it automatically reverts to a more restrictive mode. If it stays low, the limit can be extended.

This is not a one-time project. It is a continuous process of calibrating trust — much like onboarding a new employee.

We address the question of costs and return on investment from an agent perspective in AI Agent Costs in Production.

What HITL doesn't solve: limits of the approach

HITL is necessary, but not sufficient. Several things that an approval flow alone will not fix:

Prompt injection — an attacker embeds instructions into the agent's input (for example, inside a document being processed) that the agent executes without it being evident in the approval window. The operator sees a "normal" action that is backed by a manipulated flow. OWASP LLM Top 10 lists prompt injection as the number-one risk in production.

Overly broad tool scope — if the agent has access to too wide a set of tools, HITL only protects for actions that are explicitly flagged. Actions outside the explicitly monitored scope can fly under the radar. Solution: principle of least privilege for each tool.

Approval speed as a bottleneck — if a single operator is approving dozens of requests per day, the quality of their decisions declines. Routine requests get approved reflexively; real anomalies pass unnoticed. HITL must be complemented by triage: not every approval carries the same weight.

Agent drift — agent performance changes over time as data or the prompting context shifts. An action that was safely autonomous six months ago may be risky today. That is why review dates for limits are a requirement, not a recommendation.

Agent observability — tracking traces, error rates, and latencies in real time — is a prerequisite for functional HITL. Without it you have no idea when drift is occurring. We write about tools for monitoring agents in AI Agent Observability.

Practical recommendations for deployment

From experience across dozens of projects:

1.Start with full HITL on all actions. Not out of caution, but to collect data. Without a history you cannot make informed decisions about where to release autonomy.
2.Categorise actions explicitly. Have the list of categories with defined risk and limits approved by the process owner — not just the IT team.
3.Log everything, not just incidents. Routine approvals are just as important for trend analysis as rejections.
4.Set up automatic escalation. If the operator doesn't respond within X minutes, the task escalates to a deputy or safely terminates with a notification. A waiting agent must never block a production process.
5.Test bad inputs. Simulate inputs where the agent might want to take a problematic action. Verify that the interrupt fires correctly.
6.Review cycle every 3 months. Go through approval statistics. Where error rates are low and the track record is solid — consider expanding autonomy. Where surprises emerge — tighten.

Frequently asked questions

Do we need HITL for an internal chatbot that only answers questions?

If the agent only reads and responds — no tools with write access, no outward-facing actions — HITL is not necessary. HOTL is enough: you monitor outputs, set content guardrails, and have the ability to intervene. HITL is relevant where an agent *acts*, not just *answers*.

How do we stop operators from approving without thinking?

Two proven techniques: first, display the full context (why the agent is proposing the action), not just the action itself. Second, for critical categories require the operator to provide a brief justification for the approval — at least one sentence. Mandatory justification dramatically reduces reflexive approvals.

Can HITL make an agent too slow for production use?

Yes, if poorly designed. The key is selectivity: HITL only for genuinely high-risk actions, everything else on HOTL. In practice, the category of truly critical actions makes up 5–15% of total volume in a typical enterprise agent — the rest can run autonomously with monitoring.

What if the agent needs approval in the middle of a long task — will it lose context?

That is exactly why LangGraph checkpointing matters. The agent serialises its state at every interrupt point. If approval arrives an hour or even a day later, the agent resumes the flow from the exact point without losing context. Without checkpointing you would have to restart the entire task.

How does HITL relate to the EU AI Act?

Article 14 of the EU AI Act requires human oversight for high-risk AI systems. From 2 August 2026 this obligation applies to systems in areas such as employment, education, critical infrastructure, access to essential services, and others. If your agent falls into any of these categories, HITL is not a choice — it is a legal requirement that you must also be able to demonstrate in an audit.

*Designing and configuring a HITL flow for a specific agent is not a week-long project — it is an architectural decision that shapes the entire lifecycle of the system. At MP Industrial Solutions we help companies map exactly where the boundary between autonomy and oversight lies, set up an approval flow that matches their regulatory environment, and progressively build trust in agent performance. If you are considering a deployment or have an existing agent without deliberate HITL design — get in touch for a consultation.*

It never occurred to the agent to stop and ask. It had no rules for that.

Why HITL is not just a "safety patch"

Proper HITL is an architectural pattern, not an add-on. You define it before deployment, not after the first incident. It includes:

Action classification by level of risk and irreversibility
Explicit interrupt points in the agent graph — not just at the output, but directly in the decision flow
Audit trail for every step: what the agent saw, what it decided, who approved

Actions that always require approval

Not every agent action carries the same risk. The key criterion is irreversibility: can the action be undone? If not, approval is mandatory.

The second criterion is scope of impact: who does the action affect, how much money or data is at stake, what is the regulatory exposure?

Actions where we always recommend stopping and requesting approval:

Financial transactions — sending a payment, issuing an invoice, changing a price list, placing a supplier order above a defined limit
Irreversible data operations — deleting records, backing up before overwriting, exporting to an external system
External communications on behalf of the company — email to a customer, response to a complaint, legal document
Changes in third-party systems — writes to ERP, CRM, SCADA; changes to production environment configurations
Escalation decisions with legal exposure — rejecting a claim, terminating a contract, accepting a liability

Human-on-the-loop vs. Human-in-the-loop

An important distinction that will shape the entire design:

Most production systems need both modes. A simple decision tree:

1.Is the action irreversible? → HITL mandatory
2.Is the financial impact above a threshold (e.g. 1,000 EUR)? → HITL mandatory
3.Is there regulatory exposure (GDPR, contract, liability)? → HITL mandatory
4.Is the action routine and reversible? → HOTL is sufficient

This is not a one-time decision. Thresholds change as the agent builds a history and earns trust.

Technical implementation: LangGraph interrupt()

The basic pattern:

The agent processes the input and decides what action it wants to take
Before executing the action it reaches an interrupt node: it saves the current state (checkpointing), generates a description of the proposed action for the operator, and waits
The operator receives a notification (email, Slack, integrated UI panel)
On approval the flow resumes from the exact checkpoint — the agent remembers the full context, not just the last message
On rejection or modification the agent receives feedback and can replan

For a trustworthy audit trail, every interrupt node records: agent identifier, timestamp, proposed action, tool arguments, and later also the operator's decision and reasoning.

More on what agentic architectures look like in practice is covered in the article AI Agent Architectures: ReAct, Plan-and-Execute — When to Use Which.

Designing the approval flow

Recommended notification structure for the operator:

Context: What informs the decision — the original request, relevant data, steps preceding the action
Proposed action: Exactly what the agent wants to do, including arguments (e.g. "Order 500 units of SKU-4421 from supplier X at price Y")
Reason: Why the agent reached this decision
Alternatives: If the agent considered multiple options, show them (helps the operator when correcting)
Actions: Approve / Reject / Edit with comment / Escalate

We cover security and guardrails for agents in more depth in the article Guardrails for AI Agents.

Trust vs. speed: the core trade-off

That is why trust in an agent is a scalable variable, not a binary decision.

A practical framework for gradually expanding autonomy:

Phase 3 — Bounded autonomy with limits: Autonomy in low-risk categories with explicit limits (e.g. orders up to 500 EUR without approval). Every limit has a justification and a review date.

This is not a one-time project. It is a continuous process of calibrating trust — much like onboarding a new employee.

We address the question of costs and return on investment from an agent perspective in AI Agent Costs in Production.

What HITL doesn't solve: limits of the approach

HITL is necessary, but not sufficient. Several things that an approval flow alone will not fix:

Practical recommendations for deployment

From experience across dozens of projects:

1.Start with full HITL on all actions. Not out of caution, but to collect data. Without a history you cannot make informed decisions about where to release autonomy.
2.Categorise actions explicitly. Have the list of categories with defined risk and limits approved by the process owner — not just the IT team.
3.Log everything, not just incidents. Routine approvals are just as important for trend analysis as rejections.
4.Set up automatic escalation. If the operator doesn't respond within X minutes, the task escalates to a deputy or safely terminates with a notification. A waiting agent must never block a production process.
5.Test bad inputs. Simulate inputs where the agent might want to take a problematic action. Verify that the interrupt fires correctly.
6.Review cycle every 3 months. Go through approval statistics. Where error rates are low and the track record is solid — consider expanding autonomy. Where surprises emerge — tighten.

Six pillars,one delivery.

Industry & engineering

Electrical & automation

Automation & Control

Data centres & server rooms

AI, software & cloud

Smart home & IoT

Human-in-the-Loop: When to Let an AI Agent Act on Its Own

Why HITL is not just a "safety patch"

Actions that always require approval

Human-on-the-loop vs. Human-in-the-loop

Technical implementation: LangGraph interrupt()

Designing the approval flow

Trust vs. speed: the core trade-off

What HITL doesn't solve: limits of the approach

Practical recommendations for deployment

Frequently asked questions

Do we need HITL for an internal chatbot that only answers questions?

How do we stop operators from approving without thinking?

Can HITL make an agent too slow for production use?

What if the agent needs approval in the middle of a long task — will it lose context?

How does HITL relate to the EU AI Act?

Human-in-the-Loop: When to Let an AI Agent Act on Its Own

Why HITL is not just a "safety patch"

Actions that always require approval

Human-on-the-loop vs. Human-in-the-loop

Technical implementation: LangGraph interrupt()

Designing the approval flow

Trust vs. speed: the core trade-off

What HITL doesn't solve: limits of the approach

Practical recommendations for deployment

Frequently asked questions

Do we need HITL for an internal chatbot that only answers questions?

How do we stop operators from approving without thinking?

Can HITL make an agent too slow for production use?

What if the agent needs approval in the middle of a long task — will it lose context?

How does HITL relate to the EU AI Act?