Observability for AI Agents

Introduction

AI agents are increasingly being deployed in production systems — answering customer questions, writing and executing code, browsing the web, calling APIs, and coordinating with other agents to complete complex tasks. They are no longer toys or demos. They are software, and like all software, they break.

But when an AI agent breaks, you often don’t know it. There’s no stack trace pointing to line 42. The agent might return a confident, well-formatted answer that happens to be completely wrong. Or it silently calls the wrong tool seventeen times before timing out. Or it slowly drains your API budget over a weekend while nobody is watching.

This is the observability problem for AI agents — and it’s a harder version of a problem the software industry has spent years solving.

This article starts with the fundamentals of observability as it applies to traditional software systems, then examines why those fundamentals break down for AI agents, and finally looks at what observability actually means in an agentic context: what to measure, how to measure it, and where to start.

1. What is Observability

If we break the word down, observability is made up of two words: observe and ability. Which gives us a simple working definition:

Observability is the ability to understand and monitor the internal state or condition of a system — the ability to observe a system.

In complex software systems, observability is crucial for understanding how a system is behaving and for troubleshooting issues when things go wrong. It provides the insights needed to understand not just what is happening inside the system, but why it is doing it — which is essential for finding the root cause of any failure or anomaly.

Without observability, optimization and troubleshooting efforts are largely ineffective. You are acting on guesses and assumptions about what is happening inside the system rather than evidence.

The Three Pillars of Observability

Observability in software is traditionally built on three foundational data types:

Logs are timestamped, event-level records of what happened inside a system. They are the most granular form of observability data — a log entry might record that a database query ran, that a user authenticated, or that a function threw an exception. Logs are excellent for forensic investigation: when something went wrong, logs let you reconstruct exactly what happened and in what order.

Metrics are numeric measurements collected over time. Where logs tell you about individual events, metrics tell you about the state of a system at a high level — things like request latency, error rate, memory usage, and throughput. Metrics are efficient to store and ideal for alerting and dashboards. They tell you that something is wrong quickly, even if they don’t always tell you why.

Traces track the journey of a single request as it moves through a distributed system. A trace is made up of spans — individual units of work, like a database call or an HTTP request to an external service — linked together to show the full path from start to finish. Traces are the key to understanding latency and failure in distributed systems, where a single user request might touch dozens of services.

Observability vs. Monitoring

These two terms are often used interchangeably, but they describe different things. Monitoring is about watching known signals — you define a threshold (e.g., “alert me if error rate exceeds 1%”) and wait for it to be crossed. It is reactive and works well when you know what failure looks like in advance.

Observability is broader. It is the property of a system that allows you to ask arbitrary questions about its internal state — including questions you didn’t think to ask when you built the system. Good observability means you can investigate a novel failure mode without needing to instrument new code first.

In practice, monitoring uses the data that observability produces. You need both.

2. Why Traditional Observability Breaks for Agents

Traditional software follows deterministic, predictable code paths. A web request comes in, hits a route handler, queries a database, formats a response, and returns. Given the same input, the system takes the same path. Tracing that path is straightforward — you instrument known functions and services, and the trace reflects what you expect.

AI agents work differently.

An agent doesn’t follow a fixed path. It reasons. Given a user request, it decides what to do next — which tool to call, what to say, whether to delegate to another agent. The path through the system is not known in advance. It emerges from the model’s output at runtime. This breaks several assumptions that traditional observability tooling is built on.

Non-determinism. The same prompt can produce different reasoning chains on different runs. An agent that worked perfectly yesterday might take a completely different sequence of steps today, leading to a different outcome. Traditional traces assume the path is reproducible — for agents, it often isn’t.

Black-box LLM calls. The core computation — the LLM call itself — is entirely opaque. You send in a prompt, you get back a response. There are no function names, no call stacks, no internal state to inspect. The model’s reasoning process is not accessible. You can only observe its inputs and outputs.

Dynamic tool use. Agents decide which tools to call at runtime, based on what the model reasons is appropriate. A traditional trace has a fixed shape; an agent trace has a shape that depends on the model’s decisions. Instrumenting a fixed set of functions doesn’t capture what an agent does if those functions are called in unexpected combinations or not at all.

Semantic failures. When a traditional system fails, it usually fails loudly — an exception is thrown, a non-200 status code is returned, a process crashes. Agents can fail silently. The agent completes successfully from a technical standpoint but returns an answer that is factually wrong, incomplete, or subtly misaligned with what the user asked. No error is thrown. No alert fires. The failure only surfaces when a human notices the bad output — if they notice at all.

Multi-agent complexity. As systems grow to include multiple agents coordinating with each other, the problem compounds. Which agent in a pipeline caused the bad output? What context was passed between them? What did each agent see when it made its decision? These are hard questions without end-to-end tracing designed for agentic systems.

The Core Challenge

All of these problems share a common root: AI agents produce outcomes through a reasoning process you cannot inspect.

In traditional software, the internal state of a system is always derivable — it is the code, executing deterministically on known inputs. If something breaks, you can read the code, step through the call stack, and find the line responsible. The system’s behavior is fully explained by its structure.

An LLM has no inspectable internal state. It takes a prompt and produces a response, and the chain of reasoning that connects the two is inaccessible. When an agent fails — even catastrophically — there is no call stack to read. There is only what went in and what came out. And if the failure is semantic (a wrong answer rather than an error), even that signal is absent.

This means the standard debugging loop — something broke, read the code, find the cause — does not work for agents. You cannot read the model. You can only observe its behavior: the inputs it received, the decisions it made, the tools it invoked, the outputs it produced. The core challenge of agent observability is building a complete record of that behavior at every level of granularity — so that when something goes wrong, the evidence already exists.

3. What Observability for AI Agents Actually Means

The answer to the core challenge is instrumentation that treats the agent’s behavior as the source of truth, not its code. Since you cannot inspect the model’s reasoning directly, you reconstruct it from the outside: every input it saw, every decision it made visible through an output, every tool it invoked, every response it generated. Assembled in sequence, these observations become a full trace of the agent’s reasoning chain — the closest approximation to a call stack that is possible for a non-deterministic system.

This still uses logs, metrics, and traces — but extends them to capture what matters in an agentic context: reasoning context, decisions, tool interactions, and output quality.

What to Track at Each Layer

LLM calls are the atomic unit of an agent’s work. At minimum, every LLM call should be logged with: the full prompt sent to the model, the full response received, the model and version used, the token counts (input, output, cached), and the latency of the call. Token counts are especially important — they are the primary driver of cost, and runaway token usage is one of the most common failure modes in production agents.

Tool calls are where agents interact with the outside world — running code, querying databases, calling APIs, reading files. Each tool call should be traced with: which tool was invoked, what inputs were passed to it, what output it returned, and whether it succeeded or failed. Tool call logs are often where the most diagnostically useful information lives, because they show what the agent actually did, not just what it said.

Agent runs are the top-level unit of work — the full sequence of steps an agent takes from receiving a user input to producing a final output. A run trace ties together all the LLM calls and tool calls in a single agent session, making it possible to understand the full reasoning chain. Without run-level tracing, you have a collection of disconnected events rather than a coherent story.

Sessions capture multi-turn conversations over time, showing how context accumulates and how the agent’s behavior changes as the conversation evolves. Session-level observability is important for conversational agents where the history of the conversation is part of what the model sees.

Evaluation as a Dimension of Observability

Traditional observability answers: did the system run? Agent observability also has to answer: did it run correctly?

This is the evaluation dimension — and it’s unique to AI systems. Evaluation asks whether the agent’s outputs are accurate, relevant, helpful, and aligned with the intended behavior. It can take several forms: human review of a sample of outputs, automated checks using an LLM as a judge, comparison against golden datasets of known-good answers, or user feedback signals like thumbs up/down ratings.

Evaluation is not a one-time offline exercise. In a production system, it is an ongoing observability signal — as important as error rate or latency.

4. Why It Matters — Failure Scenarios

Abstract arguments for observability are easy to dismiss. Concrete failure scenarios are harder to ignore.

The infinite loop. An agent is tasked with researching a topic. It calls a search tool, gets a result, decides the result is insufficient, calls the search tool again with a slightly different query, decides that result is also insufficient, and repeats — indefinitely. Without token usage monitoring and a max-iteration safeguard, this runs until a timeout kills it or your API budget is exhausted.

Tool misuse. An agent is given access to a database query tool. It constructs a query with a subtle parameter error — perhaps passing a string where an integer is expected. The tool returns an error. The agent, reasoning that it should retry, constructs another query with the same error, in a slightly different form. This repeats across dozens of calls. Without tool call logging, you see only that the agent failed; you don’t know why.

Mid-chain hallucination cascade. An agent reasoning through a multi-step task hallucinates a fact in step 2. Every subsequent step is built on that incorrect assumption. The final output is wrong, but confidently presented. The hallucination is invisible without tracing the full reasoning chain — you can’t identify step 2 as the failure point without seeing step 2.

Silent budget drain. An agent deployed on a weekend begins receiving unusual inputs that cause it to generate unusually long responses. Token usage climbs steadily through Saturday and Sunday. No alert fires because nobody set one. On Monday morning, the cost report shows a weekend spend ten times higher than expected.

The confident wrong answer. An agent returns a well-structured, professional response to a user query. The response is factually incorrect. No error was thrown. No metric was breached. The failure is only discovered when a downstream human or system acts on the wrong information. Without output evaluation, there is no signal that anything went wrong.

5. Core Strategies

Knowing what to observe is one thing. Building the practice of observing it is another. These are the strategies that matter most.

Instrument at the LLM call level. Every call to an LLM should be logged — prompt in, response out, model, token counts, latency. This is the minimum viable observability baseline. Without it, you are flying blind.

Trace entire agent runs end-to-end. Individual LLM call logs are not enough. You need run-level tracing that ties all the calls and tool uses in a single agent session together with a shared trace ID. This is what makes it possible to debug a failure by looking at the sequence of events that led to it.

Use structured logging. Plain text logs are hard to query and filter at scale. Log in structured formats (JSON) so that downstream tools can filter, aggregate, and alert on specific fields — model name, token count, tool name, error type, agent ID.

Capture the full context window. What the model actually saw before generating a response — the full prompt including system message, conversation history, and retrieved context — is often the most diagnostically valuable piece of information. Log it.

Track token usage per step and in aggregate. Per-step token counts help identify which steps are expensive. Aggregate counts per session or run help identify runaway behavior. Both are important.

Set up cost and latency alerting before you need it. Don’t wait for a budget surprise to add cost monitoring. Set spending alerts at sensible thresholds from day one. Do the same for latency — if a run that normally takes 10 seconds starts taking 60, you want to know.

Build in output evaluation. Start simple: a human reviewing a random 5% sample of agent outputs is more valuable than no evaluation. Automate where you can — LLM-as-judge for common quality dimensions, rule-based checks for factual constraints, user feedback collection in the UI.

6. Tooling Landscape

The tooling ecosystem for LLM and agent observability has grown rapidly. Here is an overview of the main categories and tools.

Purpose-Built LLM Observability

LangSmith is the observability platform built by the LangChain team. It integrates natively with LangChain and LangGraph and provides tracing, evaluation, prompt management, and a dataset testing framework. It is the natural choice if you are building on the LangChain ecosystem.

Langfuse is an open-source LLM observability platform that is framework-agnostic. It works with any LLM or agent framework and provides tracing, cost tracking, evaluation, and a self-hostable option for teams with data privacy requirements. A strong default choice for teams not tied to a specific framework.

Arize Phoenix is open-source and focused on evaluation and debugging. It provides deep tooling for analyzing model outputs, running experiments, and identifying failure patterns. Well-suited for teams that take evaluation seriously as an ongoing practice.

Weights & Biases Weave extends W&B’s ML experiment tracking into LLM territory. It provides tracing, versioning of prompts and models, and evaluation tooling. A natural fit for teams already using W&B for model training.

Helicone is a lightweight, proxy-based solution that sits between your application and the LLM API. It requires minimal code changes and captures logs automatically. Good for quick setup, though it offers less depth than dedicated tracing tools.

General APM with LLM Support

Platforms like Datadog and New Relic have added LLM observability features to their existing APM offerings. These are useful if your team is already using one of these platforms and wants to see LLM metrics alongside the rest of your infrastructure data — though they typically offer less LLM-specific depth than purpose-built tools.

OpenTelemetry for LLMs

OpenTelemetry (OTEL) is the open standard for distributed tracing and metrics. The community is actively developing semantic conventions for LLM calls — standardized attribute names for things like model name, token counts, and prompt content. OpenLLMetry is a library that implements these conventions and can emit OTEL-compatible traces from LLM calls to any OTEL-compatible backend. This is the right approach for teams who want vendor-neutral instrumentation.

7. Challenges to Keep in Mind

Observability for AI agents is not without its own complications.

Storage cost. Prompts and responses are large. Logging every LLM call in full, for a high-volume production system, generates significant storage. You will need to make deliberate choices about what to log in full vs. what to summarize or sample.

Privacy and PII. User messages passed to an agent often contain sensitive information. Responses may repeat or synthesize it. Logging everything means storing sensitive data — which creates compliance obligations. You need a clear data handling policy for observability data, including redaction, retention limits, and access controls.

High cardinality. Every agent run generates unique trace IDs, tool inputs, and prompt contents. Metrics systems that struggle with high cardinality — many unique label combinations — can become expensive or slow. Choose your metrics dimensions carefully.

Instrumentation overhead. Adding observability adds latency. Logging synchronously can slow down your agent’s response time. Use asynchronous logging where possible, and measure the overhead of your instrumentation.

Evaluating probabilistic outputs. There is no single ground truth for most agent outputs. Evaluation requires judgment — which is expensive at scale. Automated evaluation using LLM-as-judge is useful but introduces its own error rate. Calibrate your evaluation signals against human judgment regularly.

A fast-moving ecosystem. The tooling, standards, and best practices in this space are evolving quickly. What is standard today may be outdated in six months. Build on open standards (OpenTelemetry) where you can, and avoid deep lock-in to any single vendor’s proprietary data format.

8. Where to Start — A Practical Closer

Observability can feel overwhelming to add to an existing system, especially one already in production. The instinct is often to wait until things break badly enough to justify the investment. Don’t. The time to add observability is before you need it — because when you need it, you need it immediately.

The approach below is intentionally incremental. Each step delivers standalone value, so you can stop at any point and still be better off than before.

Step 1: Log every LLM call.

This is the non-negotiable baseline. Every call to an LLM — whether it’s a simple completion or part of a complex agent loop — should produce a structured log entry containing:

The full prompt sent to the model (system message, user message, conversation history)
The full response received
The model name and version (e.g., claude-sonnet-4-6, gpt-4o)
Token counts: input tokens, output tokens, and cached tokens if applicable
Wall-clock latency from request to response
A timestamp and a unique request ID

This alone gives you cost visibility, a basic audit trail, and the raw material for everything that follows. If you do nothing else, do this.

Step 2: Add run-level tracing.

Individual LLM call logs are disconnected events. Run-level tracing connects them into a coherent story.

Assign a unique trace ID at the start of each agent run — when the user submits their request — and propagate that ID through every LLM call and tool call that happens as part of handling it. When you look up that trace ID later, you see the entire sequence: what the agent was asked, what it thought, what tools it called, what those tools returned, and what the agent ultimately said.

This is the step that transforms debugging from “something went wrong somewhere” to “here is exactly what happened and in what order.”

Step 3: Instrument tool calls explicitly.

Tool calls deserve their own log entries, separate from LLM call logs. For each tool invocation, record:

The tool name
The inputs passed to it (structured, not stringified)
The output returned
Success or failure status, and the error message if it failed
Latency of the tool call itself

This matters because many agent failures are tool failures — the model reasoned correctly but the tool misbehaved, returned unexpected data, or failed silently. Without explicit tool call logging, these failures look identical to reasoning failures.

Step 4: Set cost and latency alerts.

Before your system sees real traffic, define what normal looks like and alert when you deviate from it:

A per-run token budget: if a single agent run consumes more than X tokens, fire an alert. This catches infinite loops and runaway reasoning chains before they become expensive.
A daily spend ceiling: an absolute cap that triggers an alert (or a hard stop) if crossed.
A latency threshold: if the p95 latency for an agent run exceeds your SLA, you want to know.

Setting these thresholds requires some data — which is why steps 1–3 come first. Even rough initial values are better than none. You can tune them as you learn what normal looks like for your system.

Step 5: Start evaluating outputs — even informally.

Metrics tell you whether the system ran. Evaluation tells you whether it ran correctly. These are different questions, and traditional monitoring cannot answer the second one.

Start simple: sample 5–10% of agent outputs and have a human read them. Flag responses that are wrong, incomplete, off-topic, or otherwise problematic. Keep a running log of what you find. After a few weeks, patterns will emerge — specific input types that reliably cause bad outputs, edge cases the agent handles poorly, prompt formulations that confuse the model.

Once you have enough labeled examples of good and bad outputs, you can automate evaluation — using another LLM as a judge, writing rule-based checks for known failure modes, or building a regression suite from past failures. But the informal human review is where you start, and it often catches things no automated system would.

Step 6: Adopt a dedicated observability platform.

Once you have baseline instrumentation in place — LLM logs, run traces, tool call logs — the volume and complexity of the data will outgrow what you can manage with raw log files and dashboards you built yourself.

At this point, integrate one of the purpose-built LLM observability platforms. The right choice depends on your stack:

Langfuse if you want open-source, self-hostable, and framework-agnostic
LangSmith if you are building on LangChain or LangGraph
Arize Phoenix if evaluation depth is your primary concern
Datadog or New Relic if your team already lives in one of those platforms and wants LLM data alongside your existing infrastructure metrics

Whichever you choose, integrate it properly — not as an afterthought. Wire up the SDK, make sure trace IDs are propagating correctly, set up the dashboards you will actually look at, and configure alerts through the platform rather than through ad hoc scripts.

What “done” looks like.

You have reached a solid baseline when you can answer the following questions for any agent run, within minutes, without reading raw log files:

What did the user ask?
What did the agent do, step by step?
What did each tool call receive and return?
How many tokens did the run consume, and what did it cost?
How long did it take?
Did the output look correct?

If you can answer all six, you are in a fundamentally better position than most teams running AI agents in production today.

Conclusion

AI agents introduce a new class of software problem: systems that are powerful, flexible, and deeply opaque. They can fail in ways that traditional monitoring was never designed to detect — silently, semantically, and expensively.

Observability is how you take back control. Not by making agents deterministic — that would defeat the purpose — but by making their behavior visible. When you can see what an agent is doing, why it’s doing it, and whether the outcome was any good, you can debug failures, catch regressions, and improve the system with confidence.

The fundamentals of observability — logs, metrics, traces — still apply. But for agents, they are a starting point, not a destination. The destination is a system where every reasoning step is traceable, every tool call is auditable, every output is evaluated, and every anomaly triggers an alert before a user has to report it.

That level of observability is not built in a day. But it is built incrementally, starting with a single structured log entry for your first LLM call. The question is not whether you can afford to invest in observability. For any agent system running in production, the question is whether you can afford not to.