Automation

AI

Test Automation

Codex and Hermes Agent for Automation QA Engineers: A Practical Field Guide

10 min read

AI coding agents are becoming useful for much more than autocomplete. In quality engineering, they can inspect an application and its test suite, reproduce failures, edit automation code, run checks, compare evidence, and leave behind a reviewable change. Two particularly interesting tools are OpenAI Codex, which is deeply oriented around software repositories, and Hermes Agent from Nous Research, which is designed as a persistent, extensible autonomous agent.

This guide explains how to use both tools, where each one fits, and how automation quality assurance engineers can apply them without turning the test suite into unreviewed AI-generated code. The goal is not to remove engineering judgment. The goal is to spend less time on repetitive mechanics and more time deciding what should be tested, what evidence is trustworthy, and what risks matter.

Note: Agent products evolve quickly. Commands and capabilities in this article were checked against the official Codex and Hermes Agent documentation available on June 13, 2026.

The Short Version: Which Agent Should You Use?

Use Codex when the center of gravity is a code repository. It is well suited to understanding an existing framework, editing tests and application code, running local commands, reviewing diffs, and validating a change in the same working tree.
Use Hermes Agent when the center of gravity is an ongoing workflow. It is designed around persistent memory, reusable skills, scheduled automations, web and browser tools, messaging integrations, delegation, and operation across local or sandboxed environments.
Use them together when a workflow spans research, monitoring, communication, and repository changes. Hermes can collect and organize signals; Codex can make a tightly scoped, test-backed change in the repository.

What Is OpenAI Codex?

OpenAI Codex is a coding agent that can read, change, and run code in a selected project. It is available through the desktop app, command-line interface, IDE extension, and cloud workflows. The desktop app adds parallel threads, Git worktrees, review tools, automations, and an in-app browser.

For an automation engineer, the important distinction is that Codex is not merely producing a snippet in a chat window. It can inspect the framework conventions, find page objects and fixtures, identify the project's test commands, make edits, execute the relevant suite, and report exactly what changed.

Starting with Codex

In the Codex app, select the repository folder, keep the execution environment set to Local when you want it to work on your machine, and describe the outcome you need. In the CLI, install Codex using the current instructions in the official CLI guide, open a terminal in the repository, and run:

Code

cd path/to/your/automation-repo
codex

A useful first request is deliberately read-only:

Code

Explain this automation framework.

Identify:
1. The test runner and supported browsers.
2. Where fixtures, page objects, test data, and environment configuration live.
3. The commands for smoke, regression, API, and end-to-end tests.
4. How reports, screenshots, traces, and videos are produced.
5. The three highest maintenance risks you can infer.

Do not edit files yet. Cite the relevant file paths.

This gives you an immediate quality check on the agent's understanding. If it cannot accurately map the framework, it should not yet be trusted to refactor it.

Teach Codex Your QA Conventions with AGENTS.md

Codex reads AGENTS.md files before it starts work. This is one of the highest-leverage features for an automation team because it turns tribal knowledge into repository-level operating instructions. The file can specify framework commands, selector policy, test-data rules, evidence requirements, and review boundaries.

Code

# AGENTS.md

## Automation framework
- UI tests use Playwright with TypeScript.
- API tests use Playwright request contexts.
- Page objects live in tests/pages.
- Shared fixtures live in tests/fixtures.

## Quality rules
- Prefer role, label, placeholder, and test-id locators.
- Do not add fixed sleeps.
- Do not weaken assertions to make a test pass.
- Every bug fix must include a regression test that fails before the fix.
- Preserve trace, screenshot, and video collection on failure.
- Never place credentials or production customer data in fixtures.

## Verification
- Run npm run lint after TypeScript changes.
- Run npm run test:smoke for shared fixture changes.
- Run the narrowest affected spec before the full suite.
- Report skipped tests, retries, and warnings explicitly.

## Review boundaries
- Ask before adding a dependency.
- Ask before changing CI secrets, production URLs, or destructive cleanup jobs.
- Keep generated test data deterministic by accepting a seed.

The official AGENTS.md guide also supports layered instructions. A root file can define organization-wide quality rules, while a nested file under an API, mobile, payments, or performance-testing folder can define specialized commands and constraints.

What Is Hermes Agent?

Hermes Agent is an open-source agent from Nous Research. Its design emphasizes persistent memory, reusable and auto-generated skills, scheduled automations, delegated subagents, messaging platforms, browser and web tools, and multiple execution backends such as local, Docker, SSH, Singularity, and Modal.

That makes Hermes interesting for quality workflows that continue beyond a single code-editing session: monitoring nightly runs, assembling release-readiness briefs, researching unfamiliar failures, coordinating evidence from multiple systems, and sending summaries through a team communication channel.

Starting with Hermes Agent

Follow the current platform instructions in the Hermes documentation. The official site currently presents a shell installer for macOS, Linux, and WSL2, followed by the setup wizard:

Code

curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup

On Windows, use the installation path documented for Windows or WSL2 rather than assuming a Unix shell command will work natively. During setup, configure only the providers and integrations that your organization permits. Start with a local sandbox and read-only access to external systems.

A good first Hermes task is a bounded analysis with a concrete artifact:

Code

Inspect the latest Playwright HTML report in ./playwright-report
and the JUnit XML files in ./test-results.

Create qa-output/nightly-summary.md with:
- total, passed, failed, skipped, and flaky tests
- failures grouped by likely root cause
- the five most actionable failures
- links or file paths to supporting traces and screenshots
- suspected product defects separated from test-infrastructure defects

Do not modify tests or application code.

Codex and Hermes: Different Centers of Gravity

Repository comprehension: Codex is the natural choice for tracing code, tests, fixtures, and Git history inside a project.
Implementation and repair: Codex is a strong fit for making a scoped patch, adding regression coverage, running commands, and reviewing the resulting diff.
Persistent operational context: Hermes is designed to remember useful information and build reusable skills over time.
Scheduled work: Both ecosystems support automation, but Hermes is especially natural for persistent agent workflows and cross-channel notifications, while Codex automations are valuable for recurring repository work and follow-ups.
Cross-system orchestration: Hermes is compelling when the task spans web research, dashboards, messaging, remote environments, or delegated subtasks.
Controlled software delivery: Codex's project, worktree, review, and Git-oriented experience makes it well suited to changes that must be inspected before they are committed.

The Prompt Pattern That Works for QA

Weak prompts ask an agent to "write tests." Strong prompts define the risk, evidence, constraints, and stopping condition. A practical structure is:

Context: What system, feature, framework, and environment is involved?
Objective: What observable outcome should be achieved?
Evidence: Which requirements, traces, logs, screenshots, schemas, or existing tests should be treated as inputs?
Constraints: Which files may change? Which patterns are forbidden? What data is sensitive?
Verification: Which commands must pass, and what artifacts must be captured?
Stop conditions: When must the agent ask for approval rather than continue?

Code

Context:
This is a Playwright TypeScript suite for an e-commerce checkout.
The affected area is tests/checkout and tests/pages/CheckoutPage.ts.

Objective:
Add coverage for a declined card without duplicating existing login
and cart setup.

Evidence:
- Requirement: docs/payments/declined-card.md
- Existing happy path: tests/checkout/purchase.spec.ts
- API schema: contracts/payment-response.json

Constraints:
- Use the existing authenticatedUser fixture.
- Do not use waitForTimeout.
- Prefer accessible locators.
- Do not change production code unless the test proves a product defect.
- Use synthetic card data only.

Verification:
- Run the new test three times.
- Run npm run lint.
- Preserve trace-on-first-retry behavior.

Output:
Summarize changed files, commands run, results, and residual risks.

Use Case 1: Generate Tests from Requirements Without Losing Intent

An agent can translate acceptance criteria into a coverage model before writing code. This is safer than immediately asking for test cases because it exposes missing assumptions and redundant scenarios.

Code

Read docs/features/account-lockout.md and inspect the existing
authentication tests.

First produce a coverage table with:
- requirement
- risk
- test level: unit, API, UI, or security
- positive, negative, and boundary scenarios
- existing coverage
- proposed new coverage

Flag ambiguous requirements. Do not edit files until the coverage
table is complete. Then implement only the approved P0 scenarios.

Codex is especially useful for the second phase because it can compare the proposal with the current suite and reuse local fixtures. Hermes can help when requirements are distributed across web pages, issue trackers, or team documents, provided those integrations are authorized.

Use Case 2: Diagnose Flaky Tests Scientifically

Flaky-test repair is an ideal agent task only when the agent is prohibited from hiding the symptom. Fixed sleeps, broad retries, and weaker assertions often make a dashboard greener while making the suite less trustworthy.

Code

Investigate tests/search/live-results.spec.ts, which failed 7 of
50 CI runs.

Inspect the test, fixtures, application code, trace files, network
events, and failure screenshots.

Classify the leading hypothesis as one of:
- synchronization
- shared state
- selector instability
- test-data collision
- environment dependency
- product race condition
- resource exhaustion

Reproduce the failure before editing. Do not add sleeps or increase
retries. Implement the smallest fix and run the test 20 times.
Report the reproduction rate before and after the change.

The crucial instruction is reproduce before editing. Without it, an agent may optimize for a passing run rather than causal understanding.

Use Case 3: Build Better Page Objects and Fixtures

Codex can identify duplicated selectors, repeated setup flows, overgrown page objects, and fixtures that silently share mutable state. Ask it to quantify duplication and preserve behavior before refactoring.

Code

Review tests/pages and tests/fixtures for maintainability risks.

Find:
- selectors duplicated in three or more files
- page objects that mix assertions with navigation
- fixtures that mutate shared accounts
- helpers that swallow errors
- hard-coded environment data

Propose a small refactor ranked by risk reduction. Implement only the
highest-value item. Keep public helper signatures stable and run all
directly affected specs.

For selector work, require the agent to prefer user-facing semantics such as role, label, and stable test IDs. A technically valid CSS or XPath selector is not automatically a maintainable one.

Use Case 4: API Contract and Data Validation

Agents can compare OpenAPI documents, captured responses, client models, and automated checks. They are particularly useful for finding places where a UI test is redundantly verifying behavior that belongs in a faster API or contract test.

Code

Compare contracts/openapi.yaml with the API tests under tests/api.

For POST /orders:
- identify required and optional fields
- generate positive, negative, boundary, and authorization scenarios
- verify response status, headers, and schema
- test idempotency behavior
- avoid asserting volatile IDs or timestamps exactly
- reuse the project's schema-validation helper

Create a deterministic data builder with an optional seed.
Run only the new API spec and the linter.

Use Case 5: Turn CI Failures into an Actionable Triage Queue

This is where Hermes can act as the operational layer. A scheduled workflow can collect test reports, cluster failures, compare them with recent history, and publish a concise morning brief. It should not automatically rewrite tests in response to every failure.

Code

Every weekday after the nightly regression finishes:

1. Read the JUnit, Playwright, and accessibility reports.
2. Compare failures with the previous five runs.
3. Group failures by normalized stack trace and affected feature.
4. Mark new, recurring, recovered, and infrastructure-only failures.
5. Create a Markdown release-quality brief.
6. Notify the QA channel with counts and a link to the brief.

Never include secrets, tokens, customer data, or raw environment
variables. Never change code or quarantine tests automatically.

A human can then hand a specific cluster to Codex: reproduce it in the repository, inspect the relevant code, implement a minimal fix, and run focused verification. This division of labor keeps monitoring separate from code modification.

Use Case 6: Exploratory Testing and Browser Evidence

Both tools can participate in browser-oriented work when the relevant browser capability is configured. The agent should be given a charter rather than vague permission to click around.

Code

Exploratory charter: subscription upgrade flow.

Personas:
- trial user
- active monthly subscriber
- expired subscriber

Focus risks:
- incorrect pricing
- loss of entered data
- duplicate submission
- inaccessible validation
- broken back navigation
- inconsistent state after refresh

Use only the staging environment and synthetic accounts.
Capture steps, expected and actual results, screenshots, console errors,
and relevant network failures. Do not complete a real purchase.
Produce a session report, not automated tests.

Once a defect is confirmed, Codex can convert the evidence into a regression test. Keeping exploration and automation as separate phases prevents premature encoding of misunderstood behavior.

A Combined Codex and Hermes Workflow

Observe: Hermes collects scheduled reports, browser observations, release notes, and recurring failure signals.
Prioritize: Hermes creates a risk-ranked brief with links to evidence instead of dumping raw logs into chat.
Investigate: A QA engineer selects a concrete issue and asks Codex to reproduce it inside the repository.
Repair: Codex changes the smallest appropriate set of files and adds or updates regression coverage.
Verify: Codex runs focused tests, linting, and any required broader suite, then reports skipped checks.
Review: The engineer inspects the diff, test evidence, and assumptions before committing.
Learn: Stable lessons become an AGENTS.md rule, a Codex skill, or a Hermes skill rather than remaining hidden in conversation history.

Example: From Nightly Failure to Verified Fix

Code

Hermes task:
Analyze the latest 10 nightly reports. The checkout-confirmation
cluster is the only new P0 regression. Produce a concise evidence
packet containing failing test names, first-seen build, normalized
error, trace paths, screenshots, and likely ownership.

Codex task:
Use qa-output/checkout-confirmation-evidence.md as the starting point.
Reproduce the failure locally, inspect the checkout implementation and
tests, and determine whether this is a product defect or test defect.
Add a regression test that demonstrates the issue. Implement the
smallest justified fix, run the affected suite, and show the diff.

Guardrails Every QA Team Should Add

Keep production out of bounds by default. Use staging systems and synthetic accounts unless a separately approved production validation procedure exists.
Protect secrets and personal data. Do not paste tokens, customer records, session cookies, or unredacted logs into prompts or long-term memory.
Use least privilege. Start with read-only integrations and narrow filesystem or repository access.
Require approval for destructive actions. Database cleanup, account deletion, branch force-pushes, secret changes, and test-environment resets need explicit human confirmation.
Never auto-quarantine failures. Quarantine can hide real regressions. Require a documented owner, reason, expiry, and tracking issue.
Do not reward green at any cost. Ban assertion weakening, unconditional exception handling, arbitrary sleeps, and retry inflation unless technically justified.
Demand evidence. Ask for commands run, exit results, changed files, reproduction rates, artifacts, assumptions, and checks that could not be completed.
Review generated code exactly like human code. Agent output still needs maintainability, security, accessibility, and test-value review.

Common Failure Modes

The agent invents project conventions. Fix this by asking it to cite files and by maintaining AGENTS.md.
The generated test mirrors implementation instead of behavior. Anchor tests to requirements and user-observable outcomes.
The agent over-mocks. Specify which boundaries may be mocked and retain at least one realistic integration path.
A passing test is treated as proof. Require repeated runs for flaky scenarios and negative controls for bug fixes.
Large refactors arrive with weak verification. Limit file scope, preserve public interfaces, and split the work into reviewable stages.
Persistent memory stores stale or sensitive information. Define retention rules and periodically audit agent memory and installed skills.
Browser automation performs irreversible actions. Use sandbox accounts, spending limits, allowlists, and explicit stop conditions.

A Practical 30-Day Adoption Plan

Week 1 - Read-only assistance: repository mapping, test inventory, report summarization, requirement-to-coverage tables, and maintenance-risk reviews.
Week 2 - Narrow code changes: missing assertions, deterministic data builders, one flaky-test investigation, and one regression test for a known defect.
Week 3 - Reusable guidance: add AGENTS.md, formalize prompt templates, create a small QA skill, and define approval boundaries.
Week 4 - Controlled operations: schedule a nightly quality brief with Hermes or a Codex automation, keep integrations read-only, measure usefulness, and review false conclusions.

Track outcomes rather than prompt volume: time to triage, flaky-test recurrence, escaped defects, review rework, test runtime, duplicate coverage removed, and percentage of agent changes accepted without major revision.

Final Takeaway

Codex and Hermes Agent are not interchangeable, and that is useful. Codex is strongest as a repository-aware engineering partner. Hermes is strongest as a persistent, extensible operational agent. For automation QA engineers, the best results come from giving each tool a bounded role and requiring evidence at every handoff.

Start with explanation and diagnosis. Add tightly scoped implementation only after the agent demonstrates that it understands the framework. Encode stable team rules in AGENTS.md or reusable skills. Keep destructive actions and production access behind human approval. Used this way, agents can improve the speed of quality engineering without lowering the standard of proof.

Official References

About the Author

Dhiraj Das | Automation Consultant | 10+ years building automation systems that expose failures, reduce flakiness, and make complex workflows repeatable. He now applies that discipline independently to AI-agent validation, run replay, LLM testing, and postmortems.

He shares small open source utilities from real automation work, including: waitless (flaky tests), sb-stealth-wrapper (bot detection), selenium-teleport (state persistence), selenium-chatbot-test (AI chatbot testing), lumos-shadowdom (Shadow DOM), and visual-guard (visual regression).

Share this article:

The Short Version: Which Agent Should You Use?

What Is OpenAI Codex?

Starting with Codex

Teach Codex Your QA Conventions with AGENTS.md

What Is Hermes Agent?

Starting with Hermes Agent

Codex and Hermes: Different Centers of Gravity

The Prompt Pattern That Works for QA

Use Case 1: Generate Tests from Requirements Without Losing Intent

Use Case 2: Diagnose Flaky Tests Scientifically

Use Case 3: Build Better Page Objects and Fixtures

Use Case 4: API Contract and Data Validation

Use Case 5: Turn CI Failures into an Actionable Triage Queue

Use Case 6: Exploratory Testing and Browser Evidence

A Combined Codex and Hermes Workflow

Example: From Nightly Failure to Verified Fix

Guardrails Every QA Team Should Add

Common Failure Modes

A Practical 30-Day Adoption Plan

Final Takeaway

Official References

About the Author

You might also like

Memory Is Not a Lock: How OutcomeLock Stops Agents from Repeating Finished Work

The IDE Needs a Flight Recorder, Not Just an AI Chat Panel

How to Test AI Agents: A Practical Harness-Based Guide