Automation

AI

Test Automation

How to Build a Local-First AI Agent Flight Recorder

2 min read

A flight recorder exists because complex incidents cannot be reconstructed from memory. AI-agent runs have the same problem. The important evidence is scattered across prompts, model responses, tool calls, shell output, files, approvals, retries, and final summaries.

If you do not capture the run while it happens, the postmortem becomes archaeology.

Start with an append-only event stream

The simplest useful format is JSONL: one event per line, ordered by time. Each event should have a run id, timestamp, event type, and payload. You do not need a cloud platform to start. You need durable local evidence.

Code

{"type":"command.start","cmd":"npm run build","cwd":"repo","time":"..."}
{"type":"command.end","exit_code":0,"stdout_tail":"built in 4.15s"}
{"type":"file.diff","path":"src/components/SEO.jsx","summary":"title suffix support"}
{"type":"claim","text":"Build passed and blog pages prerendered"}

Capture the action layer

Model calls matter, but the action layer is where trust is won or lost. Capture commands, tool inputs, tool outputs, exit codes, changed files, generated artifacts, and approvals. For browser or desktop agents, capture screenshots or accessibility-tree snapshots at important steps.

OpenTelemetry’s GenAI work gives the industry a path toward standardized traces. A local recorder can follow the same spirit even before adopting a full OTel pipeline.

Redaction is not optional

Agent traces may contain source code, credentials, customer data, local paths, private prompts, and environment variables. Raw traces should stay local by default. Exported reports should be sanitized: mask secrets, normalize paths, strip irrelevant noise, and make the remaining evidence safe to share.

This is why local-first matters. Upload-by-default observability can become a new security incident.

Turn logs into timelines

Raw logs are painful. The recorder should render Markdown and HTML reports with sections humans can scan: summary, timeline, files changed, commands run, failures, redactions, and next actions. The visual timeline is not cosmetic; it reduces the cognitive load of understanding long agent loops.

The useful end state

The goal is not merely “record everything.” The goal is claim verification. When the agent says “done,” the flight recorder should show the evidence. When the run fails, it should produce a postmortem that can become a regression eval.

That is the product direction behind Agent Blackbox: local-first evidence capture for agentic work.

Sources and further reading

OpenTelemetry, AI Agent Observability
Dhiraj Das, From Passive Log-Reading to Active Stream-Tapping
Dhiraj Das, Visual Flight Recorder

About the Author

Dhiraj Das | Automation Consultant | 10+ years building automation systems that expose failures, reduce flakiness, and make complex workflows repeatable. He now applies that discipline independently to AI-agent validation, run replay, LLM testing, and postmortems.

Creator of many open-source tools solving what traditional automation can't: waitless (flaky tests), sb-stealth-wrapper (bot detection), selenium-teleport (state persistence), selenium-chatbot-test (AI chatbot testing), lumos-shadowdom (Shadow DOM), and visual-guard (visual regression).

Share this article:

Start with an append-only event stream

Capture the action layer

Redaction is not optional

Turn logs into timelines

The useful end state

Sources and further reading

About the Author

You might also like

How to Test AI Agents: A Practical Harness-Based Guide

AI Agent Reliability Checklist for Engineering Teams

Agent Observability vs LLM Observability: What Actually Matters