Automation

AI

Test Automation

From Passive Log-Reading to Active Stream-Tapping: Building a Local Flight Recorder for AI Agents

June 21, 2026 3 min read

AI coding agents are fundamentally non-deterministic. When you run an agent like Claude Code, Cursor, or Hermes, standard application logs are where diagnostic context goes to die. A traditional software crash prints a stack trace, exits, and points directly to the line of failure. An AI agent, on the other hand, can fail silently: it might happily report a `Success` code while caught in an infinite tool-use loop, hallucinating an API call, or dropping critical error details.

To debug a failed agent run, you don't just want a post-facto summary. You need a flight recorder. You need to reconstruct the precise terminal state, command-line arguments, environment parameters, and stream outputs step-by-step.

Most developers attempt this by parsing logs retrospectively (passive log-reading). But passive parsing is a brittle post-mortem strategy. It fails when processes crash abruptly, log-writing buffers delay stdout, or logs contain unredacted production secrets.

To solve this, we upgraded Agent Blackbox from a retrospective gateway log parser to an active subprocess stream-tapping wrapper.

Passive Log-Reading vs. Active Stream-Tapping

Feature	Passive Log-Reading (Old Way)	Active Stream-Tapping (The Breakthrough)
Execution Control	Retrospective analysis of pre-written log files.	Direct subprocess lifecycle management via wrapper execution.
Stream Interception	Wait for files to flush to disk; high risk of missing terminal panics.	Dual-threaded stdout/stderr redirection in real-time.
Data Fidelity	Dependent on application-specific log formats.	Exact, line-by-line terminal mirror with exit codes and execution latency.
Credential Safety	Sensitive tokens are written to disk before redaction.	In-memory token sanitization prior to writing the JSON record.
Diagnostic Portability	Custom parse scripts for different gateway targets.	Structured JSON flight records ready for replay or programmatic post-mortems.

Inside the Tapping Architecture: Thread-Safe Dual-Stream Redirection

The core breakthrough of the active recorder lies in its execution wrapper (`agent-doctor record`). Instead of launching an agent shell directly, you run it through the Blackbox:

Code

agent-doctor record --name "test-agent-run" -- python run_agent.py

Under the hood, Agent Blackbox uses Python's `subprocess.Popen` to fork the agent process, spawning concurrent background threads to tap the `stdout` and `stderr` file descriptors. This avoids blocking the main thread while maintaining sub-millisecond fidelity.

Thread-Safe In-Memory Accumulation

To capture stdout and stderr concurrently without inter-stream corruption, we spawn separate worker threads that read line-by-line using `readline` and dispatch them simultaneously to the terminal (for live user feedback) and a shared thread-safe buffer.

Here is the simplified python implementation running inside the active flight recorder:

Code

def cmd_record(args: argparse.Namespace) -> int:
    cmd_args = list(args.cmd_args or [])
    if cmd_args and cmd_args[0] == "--":
        cmd_args = cmd_args[1:]
    if not cmd_args:
        print("record requires a command to run", file=sys.stderr)
        return 2

    start = dt.datetime.now(dt.timezone.utc)
    captured: list[str] = []
    captured_lock = threading.Lock()

    popen_cmd: list[str] | str = cmd_args
    use_shell = False
    if os.name == "nt":
        popen_cmd = subprocess.list2cmdline(cmd_args)
        use_shell = True

    proc = subprocess.Popen(
        popen_cmd,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        text=True,
        encoding="utf-8",
        errors="replace",
        bufsize=1,
        shell=use_shell,
    )

    def stream_output(stream, terminal) -> None:
        try:
            for line in iter(stream.readline, ""):
                terminal.write(line)
                terminal.flush()
                with captured_lock:
                    captured.append(line)
        finally:
            stream.close()

    threads = [
        threading.Thread(target=stream_output, args=(proc.stdout, sys.stdout), daemon=True),
        threading.Thread(target=stream_output, args=(proc.stderr, sys.stderr), daemon=True),
    ]
    for thread in threads:
        thread.start()

    exit_code = proc.wait()
    for thread in threads:
        thread.join()

    end = dt.datetime.now(dt.timezone.utc)
    duration_seconds = (end - start).total_seconds()
    raw_output = "".join(captured)
    sanitized_output = sanitize(raw_output)
    run_id = f"{start.strftime('%Y%m%dT%H%M%S')}_{uuid.uuid4().hex[:8]}"
    out_dir = Path(args.out_dir)
    out_dir.mkdir(parents=True, exist_ok=True)
    out_path = out_dir / f"run_{run_id}.json"

    record = {
        "id": run_id,
        "name": args.name or " ".join(cmd_args),
        "command": cmd_args,
        "start_time": start.isoformat(),
        "end_time": end.isoformat(),
        "duration_seconds": duration_seconds,
        "exit_code": exit_code,
        "raw_output": raw_output,
        "sanitized_output": sanitized_output,
    }
    out_path.write_text(json.dumps(record, indent=2), encoding="utf-8")

    print(f"[FLIGHT RECORDER] Command finished with exit code {exit_code} in {duration_seconds:.2f}s.")
    print(f"[FLIGHT RECORDER] Recording saved to: {out_path}")
    return exit_code

In-Flight Credential Sanitization

Observability cannot come at the expense of security. Because agents regularly manipulate env vars, interact with LLM providers, and output bearer tokens, our stream tapping wrapper intercepts these strings in-memory before they ever hit the disk.

We run a dedicated sanitization pipeline applying regex filters to redact sensitive patterns:

API Keys / Secrets: Filters matching `(?i)(api[_-]?key|token|secret|password|authorization)[:=]\s*[^\s,;]+`
Bearer Credentials: Neutralizing HTTP Auth headers matching `(?i)bearer\s+[a-z0-9._\-]+`
Private Cryptographic Keys: Catching blocks matching `-----BEGIN [A-Z ]*PRIVATE KEY-----.*?-----END [A-Z ]*PRIVATE KEY-----`

The stream tapping pipeline generates two clean outputs:

Live Terminal View: Unaltered output so the local developer sees identical execution paths.
JSON Flight Record: A fully-sanitized, shareable snapshot with sensitive elements replaced by `[REDACTED_SECRET]`.

The Structured Flight Record JSON Ledger

Upon completion, the recorder automatically bundles metadata and outputs into a portable JSON ledger. This JSON acts as a deterministic timeline for downstream diagnostic engines, enabling instant post-mortems.

Code

{
  "id": "20260621T182415_a7f92b1d",
  "name": "agent-flight-deployment",
  "command": ["python", "deploy_agent.py", "--prod"],
  "start_time": "2026-06-21T18:24:15.102482+00:00",
  "end_time": "2026-06-21T18:24:22.459102+00:00",
  "duration_seconds": 7.35662,
  "exit_code": 1,
  "raw_output": "...",
  "sanitized_output": "[FLIGHT INFO] Deploying agent...\n[ERROR] Connection failed: authorization=Bearer [REDACTED_SECRET]"
}

The Automation Architect’s Verdict

To run high-velocity, human-out-of-the-loop automation, you cannot fly blind. If you do not have a dedicated, local-first flight recorder, you cannot debug flaky agent logic without wasting thousands of tokens re-running broken states.

Active subprocess stream-tapping with real-time sanitization changes the equation. It makes non-deterministic agent executions deterministic, secure, and inspectable.

From Passive Logs to Active Diagnostics

Are you still relying on passive logs to debug your agents? It's time to install a flight recorder.

About the Author

Dhiraj Das | Senior Automation Consultant | 10+ years building test automation that actually works. He transforms flaky, slow regression suites into reliable CI pipelines—designing self-healing frameworks that don't just run tests, but understand them.

Creator of many open-source tools solving what traditional automation can't: waitless (flaky tests), sb-stealth-wrapper (bot detection), selenium-teleport (state persistence), selenium-chatbot-test (AI chatbot testing), lumos-shadowdom (Shadow DOM), and visual-guard (visual regression).

June 15, 2026

pytest-why: Turning Pytest Failures into Actionable Engineering Guidance

Pythonpytest

June 13, 2026

Practical Hermes Agent Use Cases for QA Engineers: From Nightly Failures to Release Intelligence

Hermes AgentAI Agents

June 13, 2026

Codex and Hermes Agent for Automation QA Engineers: A Practical Field Guide

AI AgentsCodex

Share this article:

From Passive Log-Reading to Active Stream-Tapping: Building a Local Flight Recorder for AI Agents

Passive Log-Reading vs. Active Stream-Tapping

Inside the Tapping Architecture: Thread-Safe Dual-Stream Redirection

Thread-Safe In-Memory Accumulation

In-Flight Credential Sanitization

The Structured Flight Record JSON Ledger

The Automation Architect’s Verdict

About the Author

You might also like

pytest-why: Turning Pytest Failures into Actionable Engineering Guidance

Practical Hermes Agent Use Cases for QA Engineers: From Nightly Failures to Release Intelligence

Codex and Hermes Agent for Automation QA Engineers: A Practical Field Guide

Get In Touch