Automation

AI

Test Automation

How to Debug AI Coding Agents When They Lie About Success

3 min read

The dangerous AI-coding-agent failure is not a red error. It is the confident green lie: “Fixed and verified,” while the build was never run, the wrong file changed, or the test command failed three screens earlier.

This happens because the model is narrating its intention, not necessarily the environment. Humans do this too, but agents do it faster and with better formatting. Your defense is boring: inspect the workspace, rerun verification, and force claims to attach to evidence.

Start with the crime scene

Ignore the summary for the first five minutes. Check the repository.

Code

git status --short
git diff --stat
git diff

Look for unrelated files, generated churn, deleted assets, lockfile changes, and edits outside the requested scope. AI coding agents often produce a useful patch surrounded by accidental movement. Your job is not to admire the summary. Your job is to separate the patch from the blast radius.

Re-run the verification yourself

If the agent said “tests pass,” run the exact command. If it did not specify the command, that is already a reliability defect. Good final reports should include command, exit code, and relevant output. “Should pass” is not evidence.

The Upsun article on reliable coding agents frames the bottleneck well: model capability is not the main limiter; verification infrastructure is. Code is unusually verifiable compared with many AI tasks. That means we have no excuse to trust prose when a build, test, typecheck, or smoke run can answer.

Check for stale context

Agents often summarize stale output. They may run a test, edit code, and then report the old passing result. Or they may see a timeout and phrase it as a transient warning. Timeline matters.

A local flight recorder should capture each command with timestamp, working directory, exit code, stdout, and stderr. Without that, you are debugging from screenshots and vibes.

Common false-success patterns

The agent ran a narrow test but claimed the full suite passed.
The command failed, but the agent focused on earlier successful output.
The agent changed generated files instead of source files.
The patch fixed the visible error but broke a nearby behavior.
The agent skipped verification because dependencies were missing, then reported success anyway.
The agent edited from the wrong branch or wrong worktree.

Turn every lie into a harness rule

A false-success incident should not end with “be more careful.” Add a rule. Require final answers to include verification commands. Add a post-run checker that compares claimed files against actual diffs. Fail the run if no verification command executed. Record stdout/stderr. Make branch and worktree visible in the report.

That is how Agent Blackbox should evolve: not just pretty timelines, but claim-to-evidence checking.

Debugging rule

Do not ask “did the agent sound correct?” Ask “which environment fact proves or disproves the claim?”

Sources and further reading

Upsun, Making coding agents reliable
Anthropic, Demystifying evals for AI agents
Dhiraj Das, From Passive Log-Reading to Active Stream-Tapping

About the Author

Dhiraj Das | Automation Consultant | 10+ years building automation systems that expose failures, reduce flakiness, and make complex workflows repeatable. He now applies that discipline independently to AI-agent validation, run replay, LLM testing, and postmortems.

Creator of many open-source tools solving what traditional automation can't: waitless (flaky tests), sb-stealth-wrapper (bot detection), selenium-teleport (state persistence), selenium-chatbot-test (AI chatbot testing), lumos-shadowdom (Shadow DOM), and visual-guard (visual regression).

Share this article:

Start with the crime scene

Re-run the verification yourself

Check for stale context

Common false-success patterns

Turn every lie into a harness rule

Sources and further reading

About the Author

You might also like

How to Test AI Agents: A Practical Harness-Based Guide

AI Agent Reliability Checklist for Engineering Teams

The AI Agent Postmortem Template I Use