Automation
AI
Test Automation
Agent Observability vs LLM Observability: What Actually Matters

Agent Observability vs LLM Observability: What Actually Matters

3 min read

LLM observability and agent observability are related, but they are not the same thing. LLM observability tells you what happened around a model call: prompt, response, model, token usage, latency, maybe cost. Agent observability must explain the action chain that surrounds those model calls.

That difference matters because agents fail outside the model boundary.

The unit changes from a call to a session

A normal LLM app may be one request and one response. An agent session can include planning, retrieval, tool calls, shell commands, browser actions, file edits, retries, and a final summary. If you only trace the model call, you miss the places where the agent touched the world.

OpenTelemetry’s GenAI work exists because the industry needs shared conventions for model calls, tools, agent steps, traces, metrics, and logs. Vendor-neutral telemetry matters because otherwise every framework invents its own event shape, and teams get locked into dashboards instead of owning their evidence.

What LLM observability captures well

  • Model name and provider.
  • Prompt and response metadata.
  • Token usage, latency, and cost.
  • Error rates and provider failures.
  • Some safety or quality scores.

That is useful. It is not enough.

What agent observability must add

  • Tool selection and tool arguments.
  • Tool outputs and exceptions.
  • Files read, files changed, and diffs.
  • External state changes: tickets, database rows, calendar events, messages.
  • Approval prompts and denied actions.
  • Step count, loops, retries, and stop condition.
  • Final claim mapped to outcome evidence.

A production incident rarely says, “token count was high.” It says, “the agent emailed the wrong customer,” “the agent skipped the failing test,” or “the agent used stale context and updated the wrong record.” Those are agent-level failures.

Observability must feed evaluation

OpenTelemetry notes that telemetry for non-deterministic agents is not only for troubleshooting. It becomes feedback for evaluation. That is the right mental model. Traces should become datasets. Failure modes should become regression evals. Postmortems should become harness improvements.

If your observability stops at charts, it is passive. If it feeds tests, it becomes a reliability loop.

Practical split
Use LLM observability to understand model behavior. Use agent observability to understand work performed under model control.

The local-first wrinkle

Agent traces can contain source code, file paths, prompts, credentials, customer records, and command output. A cloud observability platform may be appropriate for some teams, but local-first capture is safer as the default for personal agents and codebase-level work. Capture raw evidence locally. Redact before sharing.

That is the bridge between observability and trust: not more dashboards, but evidence that is complete enough to debug and safe enough to review.

Sources and further reading

Dhiraj Das

About the Author

Dhiraj Das | Automation Consultant | 10+ years building automation systems that expose failures, reduce flakiness, and make complex workflows repeatable. He now applies that discipline independently to AI-agent validation, run replay, LLM testing, and postmortems.

Creator of many open-source tools solving what traditional automation can't: waitless (flaky tests), sb-stealth-wrapper (bot detection), selenium-teleport (state persistence), selenium-chatbot-test (AI chatbot testing), lumos-shadowdom (Shadow DOM), and visual-guard (visual regression).

Share this article: