Agent Blackbox

PythonAI AgentsAgent ReliabilityPostmortemsLocal-FirstTesting

The Challenge

AI agents can edit files, call tools, run tests, hang in subprocesses, hallucinate success, or fail through flaky network/model dependencies. Traditional QA automation catches deterministic UI/API regressions, but agent runs need evidence capture across prompts, commands, files, logs, timing, and side effects.

The Solution

Built a Python CLI that records agent and automation runs locally, streams output live, redacts sensitive data, classifies failures, exports Markdown/HTML postmortems, and turns ambiguous agent behavior into inspectable evidence. The product direction extends proven automation testing practices into AI-agent reliability testing.

System Architecture

Agent Reliability Flight Recorder

Agent Blackbox translates mature automation testing discipline into local-first evidence capture for AI-agent runs.

01Agent Run

coding agent / CLIWrap Codex, Claude Code, Hermes, scripts, or automation commands as subprocesses.

runtime signalsCapture stdout, stderr, timing, exit code, and command metadata.

02Reliability Layer

sanitizerSeparate raw output from safe redacted output before sharing.

failure taxonomyClassify exceptions, test failures, tool/network failures, hangs, and unsafe patterns.

03Testing Evidence

replay timelineReconstruct what the agent did and where the run became risky.

postmortem exportGenerate Markdown/HTML artifacts that developers can inspect without rerunning blindly.

Automation testing discipline applied to agentsLocal-first and redaction-safeBuilt for postmortems, not generic dashboards

✓Active subprocess stream recording
✓Sanitized raw/safe output separation
✓Run-level diagnosis and failure classification
✓Markdown and visual HTML postmortem export
✓Gateway, cron, Claude Code, and Codex-oriented diagnostics
✓Local-first privacy posture with no cloud upload
✓Agent testing roadmap: replay, mutation, eval harnesses, and policy checks

Agent Blackbox: Applying Automation Testing Discipline to AI-Agent Reliability

Agent Blackbox is a local-first flight recorder for AI coding agents. The idea is simple: if an agent can edit files, run commands, call tools, and claim success, then teams need a reliable way to inspect what actually happened.

My automation testing background is the foundation here. Years of stabilizing brittle browser flows, debugging CI failures, designing test frameworks, and turning vague failures into reproducible evidence map directly onto agent reliability.

The Problem

AI-agent runs fail differently from normal scripts.

A coding agent can:

edit the wrong file
skip the verification command
hallucinate that tests passed
hang inside a subprocess
leak sensitive output into logs
over-edit unrelated files
fail because of model, network, or tool instability

Traditional logs show fragments. They rarely explain the incident.

The Testing Insight

Serious automation testing already solved a version of this problem.

Good automation systems need:

repeatable execution
strong assertions
useful failure artifacts
environment and timing visibility
safe redaction
traceable root cause
reports that humans can act on

Agent reliability needs the same discipline, just applied to prompts, tools, terminal commands, files, and model-driven decisions.

The Solution

Agent Blackbox records and diagnoses agent runs locally.

It captures command metadata, stdout, stderr, timing, exit codes, raw output, sanitized output, and failure markers. Then it turns that run into a postmortem instead of leaving the developer to manually scrape terminal history.

Current capabilities include:

active subprocess stream recording
sanitized output separation
run-level failure classification
Markdown postmortem export
visual HTML report export
Hermes gateway and cron diagnostics
Claude Code and Codex-oriented session analysis

Why Local-First Matters

Agent traces may contain prompts, source code, credentials, customer data, stack traces, and internal file paths.

A reliability tool that uploads everything by default creates a new security problem. Agent Blackbox is local-first by design: inspect the run on the machine where it happened, redact what must not leave, and only share sanitized artifacts.

Reliability Roadmap

The next layer is agent testing, not just agent observability.

Planned project directions:

deterministic agent failure fixtures
agent behavior mutation testing
risky-action policy checks
replayable timelines
expected-diagnosis snapshots
eval harnesses for real automation engineering tasks

The metric I care about is not dashboard prettiness. It is whether the system catches agent failure modes before humans trust the next run.

Outcome

Agent Blackbox turns opaque agent behavior into evidence.

That is the bridge from traditional automation testing to AI-agent reliability: same discipline, new execution surface.

Next Project

Agent Blackbox

The Challenge

The Solution

Agent Reliability Flight Recorder

Agent Blackbox: Applying Automation Testing Discipline to AI-Agent Reliability

The Problem

The Testing Insight

The Solution

Why Local-First Matters

Reliability Roadmap

Outcome

pytest-why