A flight recorder exists because complex incidents cannot be reconstructed from memory. AI-agent runs have the same problem. The important evidence is scattered across prompts, model responses, tool calls, shell output, files, approvals, retries, and final summaries.
If you do not capture the run while it happens, the postmortem becomes archaeology.
Start with an append-only event stream
The simplest useful format is JSONL: one event per line, ordered by time. Each event should have a run id, timestamp, event type, and payload. You do not need a cloud platform to start. You need durable local evidence.
{"type":"command.start","cmd":"npm run build","cwd":"repo","time":"..."}
{"type":"command.end","exit_code":0,"stdout_tail":"built in 4.15s"}
{"type":"file.diff","path":"src/components/SEO.jsx","summary":"title suffix support"}
{"type":"claim","text":"Build passed and blog pages prerendered"}Capture the action layer
Model calls matter, but the action layer is where trust is won or lost. Capture commands, tool inputs, tool outputs, exit codes, changed files, generated artifacts, and approvals. For browser or desktop agents, capture screenshots or accessibility-tree snapshots at important steps.
OpenTelemetry’s GenAI work gives the industry a path toward standardized traces. A local recorder can follow the same spirit even before adopting a full OTel pipeline.
Redaction is not optional
Agent traces may contain source code, credentials, customer data, local paths, private prompts, and environment variables. Raw traces should stay local by default. Exported reports should be sanitized: mask secrets, normalize paths, strip irrelevant noise, and make the remaining evidence safe to share.
This is why local-first matters. Upload-by-default observability can become a new security incident.
Turn logs into timelines
Raw logs are painful. The recorder should render Markdown and HTML reports with sections humans can scan: summary, timeline, files changed, commands run, failures, redactions, and next actions. The visual timeline is not cosmetic; it reduces the cognitive load of understanding long agent loops.
The useful end state
The goal is not merely “record everything.” The goal is claim verification. When the agent says “done,” the flight recorder should show the evidence. When the run fails, it should produce a postmortem that can become a regression eval.
That is the product direction behind Agent Blackbox: local-first evidence capture for agentic work.
Sources and further reading
- OpenTelemetry, AI Agent Observability
- Dhiraj Das, From Passive Log-Reading to Active Stream-Tapping
- Dhiraj Das, Visual Flight Recorder

