LLM observability and agent observability are related, but they are not the same thing. LLM observability tells you what happened around a model call: prompt, response, model, token usage, latency, maybe cost. Agent observability must explain the action chain that surrounds those model calls.
That difference matters because agents fail outside the model boundary.
The unit changes from a call to a session
A normal LLM app may be one request and one response. An agent session can include planning, retrieval, tool calls, shell commands, browser actions, file edits, retries, and a final summary. If you only trace the model call, you miss the places where the agent touched the world.
OpenTelemetry’s GenAI work exists because the industry needs shared conventions for model calls, tools, agent steps, traces, metrics, and logs. Vendor-neutral telemetry matters because otherwise every framework invents its own event shape, and teams get locked into dashboards instead of owning their evidence.
What LLM observability captures well
- Model name and provider.
- Prompt and response metadata.
- Token usage, latency, and cost.
- Error rates and provider failures.
- Some safety or quality scores.
That is useful. It is not enough.
What agent observability must add
- Tool selection and tool arguments.
- Tool outputs and exceptions.
- Files read, files changed, and diffs.
- External state changes: tickets, database rows, calendar events, messages.
- Approval prompts and denied actions.
- Step count, loops, retries, and stop condition.
- Final claim mapped to outcome evidence.
A production incident rarely says, “token count was high.” It says, “the agent emailed the wrong customer,” “the agent skipped the failing test,” or “the agent used stale context and updated the wrong record.” Those are agent-level failures.
Observability must feed evaluation
OpenTelemetry notes that telemetry for non-deterministic agents is not only for troubleshooting. It becomes feedback for evaluation. That is the right mental model. Traces should become datasets. Failure modes should become regression evals. Postmortems should become harness improvements.
If your observability stops at charts, it is passive. If it feeds tests, it becomes a reliability loop.
The local-first wrinkle
Agent traces can contain source code, file paths, prompts, credentials, customer records, and command output. A cloud observability platform may be appropriate for some teams, but local-first capture is safer as the default for personal agents and codebase-level work. Capture raw evidence locally. Redact before sharing.
That is the bridge between observability and trust: not more dashboards, but evidence that is complete enough to debug and safe enough to review.
Sources and further reading
- OpenTelemetry, AI Agent Observability
- LangChain, Agent Evaluation Readiness Checklist
- Dhiraj Das, Agent Blackbox Visual Flight Recorder

