When we deployed an agent for an energy-sector client to automate reports from a SCADA system, the first few days went smoothly. Then the silent failures began: the agent called the right tool but sent arguments in the wrong format — a number where a string was expected, null where a value was required. The result? The tool returned an error, the agent ignored it and carried on as if nothing had happened. The report was generated, it looked correct, but the numbers were wrong. It took us several days to track down the root cause.
Tool calling — invoking external tools and systems via an LLM — is the core capability of every production agent. It is also the single largest source of silent failures. This article focuses on concrete patterns: how to design tool schemas, validate arguments, handle errors, and build retry logic that holds up under real-world conditions.
Why tool calling fails — not in the way you think
When developers investigate agent unreliability, they usually look to the model: poor reasoning, insufficient deliberation, misunderstanding of the task. In practice, we find that most production incidents have a different root cause — bad tool schema design, missing argument validation, or unhandled tool errors.
The four most common failure patterns we see:
- Agent calls the wrong tool — because tool descriptions are too similar, vague, or non-discriminating
- Agent calls the right tool with wrong arguments — because the schema doesn't precisely describe which values are valid
- Agent ignores a tool error — because nobody handled the case where the tool returns an exception
- Agent enters a retry loop — because retry logic is unbounded and lacks exponential backoff
None of these problems relate to model intelligence. All of them can be fixed at the engineering level. And all of them recur regardless of the agent architecture used.
Tool schema design: precision over brevity
The tool schema is what the agent sees when deciding what to call. If the schema is imprecise, the agent guesses — and guessing in a deterministic system is a problem.
Core schema design rules that hold up in practice:
The tool name must be an action verb plus an object. get_sensor_reading is good. sensor is bad. data is very bad. LLMs navigate by name — if two tool names sound similar, the agent will confuse them regardless of description quality.
The description must distinguish when to use the tool and when not to. Writing "returns sensor data" is not enough. Write: "Call only when you need live values from a specific device by ID. Do not use for historical data — use get_sensor_history for that." Explicit negation is just as important as a positive description.
Every parameter must have a type, description, and example. sensor_id: string is not enough. Write: "Sensor ID in the format PLANT_ZONE_NNN, e.g. A_LINE1_042. Never provide just a number without the prefix." The model will not infer the format — you must show it.
Mark optional parameters explicitly — and state the default value and what happens if they are omitted. Without this, the agent repeatedly generates null or assumes the wrong value.
Example schema before and after:
# Before (typical bad schema)
{
"name": "sensor_data",
"description": "Gets sensor data",
"parameters": {
"id": {"type": "string"},
"from": {"type": "string"},
"to": {"type": "string"}
}
}
# After (production schema)
{
"name": "get_sensor_reading",
"description": "Reads current live value from a field sensor.
Use ONLY for real-time values. For historical ranges use get_sensor_history instead.
Returns {value, unit, timestamp, status}. Throws SensorOfflineError if device unreachable.",
"parameters": {
"sensor_id": {
"type": "string",
"description": "Sensor identifier in format PLANT_ZONE_NNN, e.g. A_LINE1_042"
},
"unit": {
"type": "string",
"enum": ["raw", "normalized"],
"description": "Output unit type. Default: 'raw'",
"default": "raw"
}
},
"required": ["sensor_id"]
}The difference isn't elegance — it's the rate at which the agent calls the right tool with valid arguments. In practice, fixing schemas reduced the rate of incorrect calls by more than half without any change to the model.
Argument validation: don't take the model at its word
Even with a good schema, the model will occasionally generate arguments that are syntactically valid but semantically wrong. from_date: "2026-13-45" is a valid string but an invalid date. count: -5 is a valid number but makes no sense.
Validating arguments before actually calling the tool is a requirement, not an optimisation.
Three validation layers we recommend:
1. JSON Schema validation — the simplest layer. Verify that the agent sent the correct type for each parameter. Most frameworks (LangGraph, LangChain) do this automatically when a schema is present, but an explicit layer is safer.
2. Domain validation — verifying that values make sense in your domain context. Dates within a reasonable range, IDs in the correct format, numeric values within a valid range. You must implement this yourself — no generic framework will do it for you.
3. Referential integrity — verifying that the referenced entity exists. sensor_id: "A_LINE1_042" must be checked against the actual sensor registry — the model has no way of knowing which IDs are valid.
When validation fails, don't just throw an exception. Return a structured error message that tells the model *what* was wrong and *how* to fix it:
# Bad
raise ValueError("Invalid sensor_id")
# Good
return {
"error": "INVALID_ARGUMENT",
"parameter": "sensor_id",
"received": "A_LINE_42",
"message": "Format must be PLANT_ZONE_NNN. Did you mean A_LINE1_042?",
"hint": "Call list_sensors() to get valid IDs for this plant"
}This error message is the input to the agent's next step — the more informative it is, the greater the chance the agent corrects itself without unnecessary escalation.
Error handling: tool failures are normal, not exceptional
One of the most common conceptual problems we see in production agents: the code was written for the happy path. When a tool returns an error, the agent behaves unpredictably — it either stalls or continues with a corrupted state.
Production reality: tools fail. APIs go down. Databases are temporarily unavailable. SCADA systems return timeouts. These are not edge cases — they are routine events in industrial environments.
Basic error handling pattern for every tool:
def call_tool_safely(tool_fn, args, max_retries=3):
for attempt in range(max_retries):
try:
result = tool_fn(**args)
return {"status": "ok", "data": result}
except TransientError as e:
if attempt == max_retries - 1:
return {
"status": "error",
"type": "TRANSIENT",
"message": str(e),
"hint": "Service temporarily unavailable, retry later"
}
time.sleep(2 ** attempt) # exponential backoff
except PermanentError as e:
return {
"status": "error",
"type": "PERMANENT",
"message": str(e),
"hint": "Do not retry. Escalate to human operator."
}
except Exception as e:
return {
"status": "error",
"type": "UNKNOWN",
"message": str(e),
"hint": "Unexpected error. Log and escalate."
}Key principles:
- Distinguish transient from permanent errors. A timeout is a transient error (retry). An authorisation error is permanent (don't retry, escalate). The model must know what to do next.
- Exponential backoff — not a fixed pause. Each retry waits twice as long (1 s, 2 s, 4 s…). Prevents overloading systems during outages.
- Maximum retry count — always bounded. Without a limit, the agent can enter an infinite loop. In practice, a retry rate of around 12 % is common in production systems — if it is not under control, you can double your total call volume and costs.
- Structured error message — the agent must receive information about the error type and the recommended next step.
You can read more about the cost implications of retry logic in the article on AI agent costs in production.
Idempotency: the forgotten reliability requirement
When an agent retries a tool call, it must be safe to call it multiple times with the same arguments. If it isn't — you have a problem.
Idempotent call: calling get_sensor_reading(sensor_id="A_LINE1_042") multiple times returns the same (or equivalent) value with no side effect. Safe to retry.
Non-idempotent call: calling create_work_order(equipment_id="P-042", type="inspection") multiple times creates multiple work orders. A disaster on retry.
Rule for production systems: every tool that performs a write, mutation, or side-effecting action must be protected against duplicate calls.
Protection patterns:
Idempotency key — the agent generates a unique key for each task and sends it with the call. The backend registers the key, and a second call with the same key returns the result of the first call without executing the action again:
def create_work_order(equipment_id, type, idempotency_key):
existing = db.find_by_key(idempotency_key)
if existing:
return existing # Return the previous result
result = actually_create_work_order(equipment_id, type)
db.store(idempotency_key, result)
return resultState check before action — before a write operation, verify whether the state already matches the target. If a work order for equipment P-042 for today already exists, don't create a new one. This is simpler than an idempotency key and suitable for business domains with natural uniqueness constraints.
Explicit documentation in the schema — every tool with a side effect must state in its description: "This action is irreversible / creates a record. Call only when you are certain the conditions are met." A model that receives this information behaves more cautiously.
Why agents call the wrong tool — and how to reduce it
One of the more puzzling problems: an agent consistently calls tool_A instead of tool_B even though tool_B is correct. There are several causes, each with a different solution.
Names and descriptions that are too similar. If you have get_equipment_data and fetch_equipment_info, the agent will mix these up. Solution: go through your entire tool set and verify that names and descriptions are unambiguously distinct. If two colleagues cannot tell the difference without reading the code, the model can't either.
Too many tools at once. LLMs have limited attention capacity. When you present an agent with 30 tools in a single prompt, the probability of incorrect selection increases. The solution pattern is toolset routing — the agent first selects a group of tools (e.g. "sensor tools" vs "reporting tools"), then picks a specific one from that group. This connects directly to the MCP protocol — servers serve as natural tool groups.
Missing context in the prompt. If the system prompt doesn't tell the agent what context it is operating in, the model infers the right tool from conversation history. Explicit context in the system prompt — "You are working in the manufacturing management system, not the HR system" — reduces tool selection errors.
Model training bias. Some models have training data that favours certain call patterns. If the agent repeatedly calls the wrong tool despite a correct schema, try changing the tool name — sometimes that alone helps.
Retry logic: five rules that will save your production system
Retry is an integral part of reliable tool calling. But poorly implemented retry is worse than no retry at all — it can turn a transient outage into a cascading failure.
Rule 1: Only retry transient errors. HTTP 503 (Service Unavailable) and timeouts are transient. HTTP 400 (Bad Request) and 403 (Forbidden) are permanent — retrying won't help, it only wastes resources.
Rule 2: Exponential backoff with jitter. A fixed pause (retry every second) overloads the system during an outage. Exponential backoff with random jitter (±20 %) spreads the load:
delay = min(base_delay * (2 ** attempt) + random.uniform(0, 0.3), max_delay)Rule 3: Hard limit on attempts. Three attempts are usually enough. Five is the maximum. More retries → greater risk of a retry storm (all agents simultaneously retrying a failed system).
Rule 4: Circuit breaker for recurring outages. If a given tool fails multiple times in succession, temporarily "disconnect" it and immediately return CIRCUIT_OPEN for all subsequent calls. After a cooldown period, allow one test call through. This also protects your backend systems.
Rule 5: Fallback action. When both retry and circuit breaker fail, what does the agent do? Escalate to a human operator, continue without that piece of information, or stop? Every critical tool must have a defined fallback strategy. Without one, the agent is unpredictable under failure — which is a direct motivation for HITL gates.
Testing tool calling before production
Tool calling reliability cannot be verified by manual happy-path testing alone. A few testing patterns we recommend:
Mock toolset testing — create mock versions of all tools that return different combinations of responses: success, transient error, permanent error, malformed output. Run the agent against these mocks and observe behaviour.
Argument boundary testing — manually test what arguments the model generates for edge-case inputs: empty strings, null values, extreme numbers, dates far in the past or future. The schema must explicitly exclude these cases or the agent will fail on them repeatedly.
Chaos testing — with a certain probability (e.g. 20 %), inject an error into a tool during testing. Verifies that retry logic and fallback behaviour work in practice, not just in unit tests.
Trace review — for production deployments, observability and tracing are essential. Every tool call must be logged with its arguments, result, latency, and any error. Without this, you won't even know where to look when a production incident occurs.
Frequently asked questions
How many tools can an agent have at once?
In practice, above 15–20 tools in a single context, selection reliability degrades significantly. The solution is toolset routing or MCP servers — the agent first selects the right group of tools, then picks a specific tool from that group. If you have 50+ tools, hierarchical navigation is a necessity, not an optimisation.
What should I do if the agent generates hallucinated arguments — values that don't exist?
Two steps: first, the schema must explicitly show the format of valid values, including examples. Second, validation must verify the existence of referenced entities (e.g. sensor_id against the registry) and return a structured error with a hint on how to find valid values. Additionally, a list_available_X() tool (e.g. list_sensors()) helps the agent look up valid IDs itself without hallucinating.
Is it safe to let an agent execute irreversible actions automatically?
It depends on context. For low-risk actions (reading data, generating reports), full automation is fine. For irreversible actions (creating orders, modifying configurations, sending messages), we recommend an explicit HITL gate — the agent proposes the action, a human operator approves it. The EU AI Act from August 2026 requires human oversight for high-risk AI systems — this directly implies HITL for actions with potentially serious consequences.
What is the fastest way to reduce the rate of incorrect calls?
In practice, fixing tool schemas has the biggest impact — specifically, discriminating descriptions, examples of valid values, and explicit negative scoping ("do not use this for…"). The second biggest impact comes from reducing the number of tools in a single context. These two changes require no model or architecture changes.
How do I monitor tool calling quality in production?
Log every call with: timestamp, tool_name, arguments (without PII), status (ok/error), error type, latency, attempt number. Track the error rate per tool — if one tool has an error rate 3× higher than average, that is a signal to audit the schema. Tools such as Langfuse (self-hostable) or LangSmith enable this without building your own observability infrastructure.
Conclusion
*Tool calling is where most agent reliability problems begin — and also where they can be addressed purely at the engineering level, without changing the model. Well-designed schemas, argument validation, structured error handling, and bounded retry logic transform an agent from an unpredictable system into an operable tool. If you are about to deploy an agent to production or are currently troubleshooting reliability issues with an existing system, we are happy to assess your specific case — we will propose tool design, validation, and error handling for your environment.*
