AI Agent Observability: When You Don't Know Why the Agent Did What It Did

Q: Does observability work for agents not built on LangGraph?

Yes. `Langfuse` and `Arize Phoenix` are framework-agnostic — you simply call the SDK on every LLM call and tool call. If you are writing your own agent without a framework, you can add OpenTelemetry spans manually. LangSmith has the deepest integration with LangGraph, but integrations also exist for other frameworks.

We got a call from a logistics client: an agent built to automate freight bookings had started repeatedly selecting the more expensive carrier instead of the cheaper one. The price difference was visible — the cause was not. The logs showed inputs and outputs. They did not show why the agent, at a specific step, called a tool with the parameters it called. It took us four hours to reconstruct the decision sequence from raw logs. The fix itself took fifteen minutes — but only once we knew where to look.

This is a problem that recurs in most first production agent deployments. Teams know how to build an agent, they know how to run it, but they don't know how to debug it. Observability — the ability to understand a system's internal state from its external outputs — is not a luxury for agents. It is the prerequisite for being able to maintain and improve the system.

Eval vs. observability: two different things

Before we get to tools and approaches, it's important to distinguish two concepts that are frequently conflated.

Eval (evaluation) answers the question: *is the quality of the agent's output good enough?* It is measured offline or on historical data — for example: was the agent correct in 87% of cases? Did it generate the right format? Did it answer questions correctly?

Observability answers the question: *what exactly happened during a single run?* It is measured at runtime — while the agent is running or once a run has completed — and captures the sequence of decisions, tool calls, intermediate state, and errors.

Eval is retrospective statistics. Observability is a surgical instrument for a specific incident. Both are necessary — they do not duplicate each other.

Most teams start with eval and ignore observability. In practice, this works until the first production incident. After that, the priorities reverse.

What you must capture

An agent is not a single function. It is a network of steps — each step has an input, an output, an LLM prompt, a decision, and potentially an error. For every agent run you need to see:

Trace: the complete call tree from the entry prompt to the final response, including all subtrees
Spans: the individual steps within a trace — every LLM call, every tool call, every retrieval
State diff: what changed in the agent's internal state between steps (critical when using LangGraph)
Tool calls: the exact parameters of every tool call and the exact output the tool returned
LLM prompts: what the actual prompt sent to the model was (not the template — the final text after interpolation)
Tokens and latency: token count per span, time per span — for identifying bottlenecks
Errors and retries: where the agent failed, how many times it retried, with what result

Every one of these elements is essential. If you only capture the output of the final step, you have an application log — not agent observability.

Why classic logging is not enough

When developers first hear "observability," most respond: "but we already log." They do log — just not correctly for agents.

A classic log is sequential. Record after record, timestamp, message. For a simple API that is enough. For an agent where a single user input triggers dozens of LLM calls, tool calls, and retrieval steps — in a tree, not a sequence — a flat log is unusable. You have no context, no hierarchy, no state over time.

Another problem: agent frameworks like LangGraph internalize many decisions. Which node was selected, why a particular edge was activated, what the state was at a checkpoint — none of this is visible from a standard stdout log. You need integration directly with the framework, not a wrapper around it.

A third problem is sampling and cost. A production agent can execute thousands of runs per day. Logging *everything* into structured traces is expensive on storage. You need a sampling strategy — for example, capture 100% of errors and long-running runs, 10–20% of randomly sampled successful runs.

Trace: the foundation of agent observability

A trace is the root container for a single agent run. It corresponds to one task — for example, one user request or one scheduled operation. Inside the trace are spans — each span is one step with its metadata.

The hierarchy looks like this:

Trace: "Book freight for order #4821"
- Span: LLM call — understanding the request
- Span: Tool call — search_carriers(origin, destination, date)
- Span: Tool call — get_price(carrier_id=12)
- Span: Tool call — get_price(carrier_id=17)
- Span: LLM call — decision and reasoning
- Span: Tool call — book_shipment(carrier_id, price, ...)

Each span has: start and end time, input, output, any error, parent ID. From this we can reconstruct the full decision flow without guesswork.

In LangGraph, this tree maps naturally onto graphs — every node in the graph is a span. Tools like LangSmith capture the state diff at every graph edge transition, providing a detailed view of how state evolves.

ReAct tracing: reason-act-observe in the logs

When an agent uses a ReAct architecture — the Reason, Act, Observe loop — observability should capture each of these steps explicitly.

Reason step: what the agent "thought" — the model's chain-of-thought output before calling a tool. This is the most valuable step for debugging. When an agent calls the wrong tool, the answer is almost always visible in the reason step — the model "decided" incorrectly for a reason that is right there in the text.

Act step: the exact tool call with parameters. Not the template — the actual values after interpolation. Errors in tool calling — malformed parameters, wrong type — are visible here.

Observe step: what the tool returned. Importantly: the raw tool output before the model processes it. When a tool returns an unexpected format or an error message, this is where you will see it.

Without explicitly logging these three steps per iteration you only have the final output. And the final output will not tell you why the agent changed its decision after the third iteration.

Metrics to track

Beyond per-run traces you need aggregated metrics for monitoring:

Latency p50/p95/p99 — not just overall run latency, but per-node latency. Where is the agent spending its time?
Token consumption per run — trend, outliers (runs with 10× more tokens = error pattern)
Tool error rate — percentage of runs where at least one tool call failed
Retry rate — what percentage of runs required a retry. A retry rate of 12% or higher typically signals a problem in schema validation or in the prompt
Abandonment rate — runs where the agent did not reach its goal and returned a fallback response
Human escalation rate — in HITL (human-in-the-loop) systems: percentage of runs escalated to a human

Each of these metrics is symptomatic. A high retry rate points to unstable tool calling. A high p99 latency combined with low p50 latency points to outlier scenarios — likely long contexts or retry loops. A high abandonment rate points to edge cases the agent cannot handle.

Tools: LangSmith, Langfuse, Arize Phoenix

For production deployment, three main platforms exist today:

`LangSmith` — the deepest integration with LangGraph and LangChain. Automatically captures traces, state diffs, tokens, latency. Has a playground for replaying runs — you can take a specific trace and re-run it with a modified prompt. Cloud-hosted, with pricing based on trace volume. If your stack is LangGraph, LangSmith is the natural choice.

`Langfuse` — framework-agnostic, self-hostable (Postgres + ClickHouse). SDK for Python, TypeScript, REST. Works even when you are not writing through LangGraph — you call the SDK directly on every LLM call and tool call. Suitable for teams that want full control over their data (regulated industries, on-prem requirements). Self-hosting adds ops overhead — a ClickHouse cluster needs maintenance.

`Arize Phoenix` — OpenTelemetry-based. Captures traces via the standard OTel protocol, which means integration with your existing monitoring stack (Prometheus, Grafana, Jaeger). Additionally: 50+ eval metrics directly in the platform — faithfulness, hallucination detection, relevance. For teams where agent observability should be part of a broader APM solution.

Choosing: if you are on LangGraph in the cloud — LangSmith. If you need self-hosting or have a heterogeneous stack — Langfuse. If you want observability and eval in one place and have OTel infrastructure — Arize Phoenix.

For smaller projects in an early stage: self-hosted Langfuse on a simple Docker Compose is sufficient and saves costs.

Replay and debugging a specific incident

Observability has the greatest value during a specific incident. The process that works in practice:

1.Identify the trace by trace ID or by filter (for example: all runs with tool error rate > 0 in the last 24 hours)
2.Open the state diff — examine how the agent's state changed step by step. Where did an unexpected change occur?
3.Read the reason text in each LLM call — look for the logic that answers "why did the agent make this decision"
4.Check the tool outputs — could the problem be in what the tool returned, not in what the agent did with the output?
5.Run a replay with a modified prompt or parameters — validating a hypothesis without needing the real environment

This process reduces MTTR (mean time to resolution) from hours to minutes. From experience: most agent bugs are not in LLM reasoning — they are in tool calling, in an unexpected output format from a tool, or in state management. Observability enables this distinction in seconds.

Alerts and proactive monitoring

Reactive debugging is only one layer. A production system also needs proactive alerts.

Basic rules for alerting:

Tool error rate > 5% over a one-hour window → alert, likely a change in the downstream system or in the tool schema
Average run latency > 2× baseline → alert, outlier scenario or upstream slowdown
Abandonment rate > 10% per day → alert, new edge cases appearing in production
Token cost per day > X EUR — cost alert, essential for agents where looping can generate calls without a limit. An agent without a cost alert can generate thousands of calls overnight and produce a bill in the thousands of euros — a risk we see in practice

Alerts should link to the trace, not just to a number. When I receive an alert "tool error rate 8%," I want a direct link to the trace where it occurred — not a number without context.

Observability and agent costs

Observability itself generates costs — storage for traces, compute for analytics. For production systems with high volume it is important to configure a sampling strategy.

Recommended approach:

100% sampling for error runs (tool error, abandonment, timeout)
100% sampling for runs with unusual latency (p99 outliers)
10–20% sampling for successful runs within normal parameters
100% sampling during the rollout of a new agent version (first 24–48 hours)

This reduces storage costs by 5–10× while preserving full visibility for diagnostics. Platforms like Langfuse support sampling rate configuration natively.

Connection to agent costs in production: observability lets you identify which runs are inefficient — for example, runs where the agent repeatedly calls the same tool because of an ambiguous response. This is typically a problem in tool calling reliability that observability exposes before it has a chance to significantly inflate costs.

The security dimension of tracing

Traces contain sensitive data — prompts, tool outputs, the agent's internal state. In regulated industries (finance, healthcare, legal), this requires:

Encryption of traces at rest and in transit
Access control — who can read trace records?
Retention policy — how long do you keep traces?
PII scrubbing — personal data in prompts or tool outputs must not be stored in plain text

Self-hosted Langfuse gives full control over all of these aspects. Cloud platforms must be evaluated against your GDPR/data residency requirements — especially if traces contain company or customer data.

For on-prem LLM deployments, self-hosted observability is almost always a requirement, not a choice.

Frequently asked questions

Do I need observability before the first agent deployment?

Not necessarily before the first test deployment — but yes, before deploying to production with real users or real data. Without observability you have no way to identify where the agent is failing, or why. The first production incident that requires debugging without tracing typically convinces a team faster than any argument.

What is the difference between tracing and standard logging?

A classic log is a flat sequence of records. Tracing captures a hierarchical tree — a trace contains spans, spans contain child spans. For an agent where a single input triggers dozens of calls in a tree, the hierarchical structure is essential. Additionally, tracing captures the state diff between steps, which a flat log cannot provide.

Does observability work for agents not built on LangGraph?

Yes. Langfuse and Arize Phoenix are framework-agnostic — you simply call the SDK on every LLM call and tool call. If you are writing your own agent without a framework, you can add OpenTelemetry spans manually. LangSmith has the deepest integration with LangGraph, but integrations also exist for other frameworks.

How much does it cost?

Langfuse self-hosted is free (only infrastructure ops costs — a simple Docker Compose with Postgres + ClickHouse). LangSmith and Arize Phoenix have a free tier for low volumes; paid plans start in the range of tens of euros per month. The main cost is not the platform — it is storage for traces, especially at high run volumes.

Should I log full LLM prompts?

Yes — with one condition: if prompts contain personal or company data, you must handle PII scrubbing before saving. Logging only metadata (token count, latency, tool name) without prompt content is cheaper, but significantly limits debuggability. We recommend logging full prompts with a retention policy and access control in place.

Conclusion

*AI agent observability is not about tools — it is about a culture of accountability for what your system does in production. When an agent fails, the question "why" must have an answer within minutes, not hours. At MP Industrial Solutions we help teams set up observability as part of the agent architecture — not as an afterthought. If you are considering deploying an agent or are dealing with an incident in an existing system, we are happy to look at your specific case.*