Automation

AI

Test Automation

The AI Agent Postmortem Template I Use

3 min read

Most AI-agent failures are described badly: “it got confused,” “it hallucinated,” “it ignored the prompt.” Those are symptoms, not postmortems. A useful postmortem turns the run into a timeline, the timeline into a cause, and the cause into a harness improvement.

The point is not to blame the model. The point is to stop paying for the same failure twice.

The template

Use this structure after any serious agent failure or false-success run.

Goal: what the agent was asked to do.
Autonomy level: read-only, tool-using, code-editing, external-action, or scheduled.
Environment: branch, commit, working directory, tool permissions, key dependencies.
Timeline: prompts, decisions, tool calls, commands, outputs, file changes, approvals.
Expected outcome: what should have changed.
Actual outcome: what changed in the real environment.
Evidence: diffs, logs, screenshots, database queries, test output, generated artifacts.
Root cause class: model, prompt, tool, environment, policy, verification, or human handoff.
Blast radius: what could have been affected.
Redaction: what sensitive material was removed before sharing.
Prevention: new eval, new guardrail, new fixture, new approval gate, or better report.

Why final summaries are not enough

An agent final message is a witness statement. It is not the evidence. It may be accurate, incomplete, stale, or fabricated. Anthropic’s eval vocabulary separates transcript and outcome for exactly this reason. The transcript records what happened. The outcome proves whether the task succeeded.

For coding agents, the outcome is often executable: tests, typecheck, build, diff, deployed preview. For business agents, the outcome is external state: a ticket, booking, message, or record. The postmortem must inspect that state directly.

A simple failure taxonomy

Model reasoning failure: wrong plan, wrong assumption, missed constraint.
Prompt/spec failure: success criteria were ambiguous.
Tool failure: tool returned bad data, timed out, or hid an error.
Environment failure: dependencies, credentials, network, branch, test data.
Policy failure: action should have required approval but did not.
Verification failure: agent did not run the check or misread output.
Reporting failure: final answer omitted risk or overstated success.

This taxonomy matters because each class has a different fix. A tool timeout does not need a better prompt. A missing approval gate does not need a smarter model.

Make the postmortem feed the harness

A postmortem is not done until one thing changes. Add a regression eval. Add a required check. Improve redaction. Split a dangerous tool. Make the final report include exit codes. Capture branch state. Whatever the fix is, it must land somewhere durable.

The standard

If the next run can fail the same way without being caught earlier, the postmortem was documentation, not engineering.

Sources and further reading

Anthropic, Demystifying evals for AI agents
LangChain, Agent Evaluation Readiness Checklist
Dhiraj Das, Agent Blackbox Guide

About the Author

Dhiraj Das | Automation Consultant | 10+ years building automation systems that expose failures, reduce flakiness, and make complex workflows repeatable. He now applies that discipline independently to AI-agent validation, run replay, LLM testing, and postmortems.

Creator of many open-source tools solving what traditional automation can't: waitless (flaky tests), sb-stealth-wrapper (bot detection), selenium-teleport (state persistence), selenium-chatbot-test (AI chatbot testing), lumos-shadowdom (Shadow DOM), and visual-guard (visual regression).

Share this article:

The template

Why final summaries are not enough

A simple failure taxonomy

Make the postmortem feed the harness

Sources and further reading

About the Author

You might also like

How to Test AI Agents: A Practical Harness-Based Guide

AI Agent Reliability Checklist for Engineering Teams

How to Debug AI Coding Agents When They Lie About Success