Latest Insights

pytest-why: Turning Pytest Failures into Actionable Engineering Guidance

Read Article

Build a Private, Conversion-Focused AI Front Desk for WhatsApp

Read Article
pytest-why

pytest-why

PythonPyPIpytestDiagnosticsCI/CDTest Automation

The Challenge

Pytest provides technically complete tracebacks, but developers still spend time deciding whether a failure came from assertions, fixtures, imports, timeouts, collection, browser timing, or teardown before they can act.

The Solution

Built an opt-in pytest plugin that listens to runtime and collection reports, classifies supported failure patterns, enriches Selenium and Playwright failures with browser context, and produces terminal, Markdown, and escaped standalone HTML guidance without hiding the original traceback.

  • ✓Opt-in pytest --why workflow
  • ✓Setup, call, teardown, and collection failure capture
  • ✓Assertion, fixture, import, timeout, and unknown classifications
  • ✓Selenium and Playwright context-aware hints
  • ✓Shareable Markdown and standalone HTML reports
  • ✓Complete raw traceback preservation
  • ✓Safe Markdown fences and escaped HTML output
  • ✓Pytester-based end-to-end plugin verification

pytest-why: Turning Pytest Failures into Actionable Engineering Guidance

Pytest is excellent at telling us that a test failed. It gives us the failing node, the test phase, the assertion diff, and the traceback that led to the error.

The harder question is what to do next.

In a large test suite, the raw output can be technically complete and still require several minutes of interpretation. Was the failure caused by an incorrect assertion, a fixture that never initialized, a broken import, a timeout, or a browser interaction that happened before the page was ready?

pytest-why adds a small diagnostic layer to that workflow. It observes failed pytest reports, classifies common failure patterns, prints a concise explanation at the end of the run, and creates Markdown and HTML reports that can be shared with the rest of the team.

The package does not replace pytest's traceback. It organizes the traceback around three practical questions:

  1. What kind of failure is this?
  2. Why does this category of failure usually happen?
  3. What should the developer inspect first?

This article explains what pytest-why does, how it integrates with pytest, how the classifier works, how reports are generated, and where the current design draws its boundaries.

The Problem: A Traceback Is Evidence, Not a Diagnosis

Consider a simple failing test:

def test_total_price():
    subtotal = 100
    tax = 18

    assert subtotal + tax == 120

Pytest correctly reports:

E       assert (100 + 18) == 120

For an experienced developer, the next step is obvious: compare the expected and actual values and trace where they diverged. For a new contributor, a large CI log, or a failure buried among dozens of other failures, that interpretation still takes time.

Other failures are less direct:

  • A missing fixture fails during setup, before the test body runs.
  • An import error may stop collection entirely.
  • A timeout often points to a blocked operation rather than the final line in the traceback.
  • A Selenium or Playwright failure may be caused by selector drift, page timing, an incorrect wait, or an element that is present but not interactable.
  • A teardown error can make a successful test appear alongside a broken cleanup process.

The common thread is that pytest provides the evidence, but developers still have to classify the failure before debugging it.

pytest-why makes that classification explicit.

What pytest-why Does

The package is a pytest plugin activated through one command-line flag:

pytest --why

When enabled, it:

  • Captures failures from test setup, test execution, and teardown.
  • Captures errors that happen during test collection.
  • Examines pytest's textual traceback representation.
  • Assigns a known failure category when a supported pattern matches.
  • Prints a short explanation and debugging hint in the terminal summary.
  • Writes pytest-why-report.md.
  • Writes pytest-why-report.html.
  • Preserves the complete raw traceback in both reports.

Without --why, the runtime plugin is not registered and no report files are created.

That opt-in behavior matters. A pytest plugin should not silently change every test run merely because it is installed. Teams can add --why locally while debugging, enable it in a CI troubleshooting job, or use it only when they need shareable diagnostics.

Installation

Install the package from PyPI:

python -m pip install pytest-why

Then run an existing test suite with:

python -m pytest --why

No test code changes are required.

A First Example

Given this test:

def test_math():
    assert 2 + 2 == 5

Running:

pytest --why

adds a summary like this:

================ pytest-why: failure explanations ================
Total failures: 1
Assertion mismatch: test_math.py::test_math (call)
  Why: The code ran, but the observed value or state did not match what the test expected.
  Hint: Compare the expected and actual values near the final assertion, then trace where they first diverge.
Reports: pytest-why-report.md, pytest-why-report.html

The original pytest failure remains visible. The plugin adds a compact interpretation after the normal output instead of hiding or rewriting pytest's diagnostics.

Understanding Pytest's Failure Phases

One of the most useful pieces of context in a pytest failure is the phase in which it occurred.

A test item can produce reports for:

  • setup: fixtures and other pre-test preparation.
  • call: execution of the test function.
  • teardown: fixture finalizers and cleanup.

Collection errors happen even earlier, while pytest imports test modules and discovers tests.

pytest-why records the phase with every failure. That is especially important for fixture classification. The word "fixture" appearing in an assertion message does not necessarily mean pytest failed to resolve a fixture. The classifier treats missing fixtures, scope mismatches, and recursive fixture dependencies as fixture errors only when the report phase is setup.

For example:

def test_requires_database(database):
    assert database.is_connected()

If database is not defined, pytest fails during setup. pytest-why reports:

Fixture error: test_database.py::test_requires_database (setup)
  Why: Pytest could not prepare the test because a fixture is missing,
       has an incompatible scope, or depends on itself.
  Hint: Check the fixture name, where it is defined, its scope, and its
        dependency chain.

The phase tells the developer not to debug the assertion or application logic: the test body never ran.

The Classification Model

The current classifier recognizes five outcomes.

ClassificationTypical signalsFirst debugging direction
Assertion mismatchAssertionError, pytest assertion output, collection diffsCompare expected and actual values
Import errorImportError, ModuleNotFoundError, missing module or symbol textCheck installation, import paths, symbols, and circular imports
Fixture errorMissing fixture, ScopeMismatch, recursive fixture dependency during setupCheck fixture discovery, scope, and dependency chains
TimeoutTimeout exceptions, timeout plugin output, "timed out" textFind the blocked operation and inspect wait boundaries
Unknown failureNo supported pattern matchedStart with the final application frame and inspect nearby state

Classification order is deliberate.

A fixture error is checked first because setup failures often contain generic language that could overlap with other patterns. Import and timeout failures are then checked before assertion mismatches. Anything that does not match a known category receives an unknown_failure result rather than being forced into an incorrect explanation.

The unknown category is a necessary design choice. Diagnostic software should be willing to say that it does not recognize a failure. A broad but inaccurate classification would send the developer in the wrong direction.

Browser Automation Context

Browser failures have a recognizable debugging surface. A missing element in a Selenium or Playwright test can involve:

  • An outdated selector.
  • A page that has not finished loading.
  • A missing explicit wait.
  • An element outside the visible or interactable state.
  • A stale reference after a DOM update.

When the traceback contains browser-related terms such as selenium, webdriver, NoSuchElementException, playwright, locator, or page., pytest-why appends a browser-specific hint to the base classification.

For example:

def test_login_heading(page):
    page.goto("http://localhost:8000/login")
    heading = page.locator("h1").text_content()

    assert heading == "Dashboard"

If the page renders Welcome back, the result is still an assertion mismatch, but the hint also asks the developer to verify selectors, waits, page timing, and element visibility.

This is a useful distinction: browser context enriches the classification without creating a separate category for every automation framework exception.

How the Pytest Plugin Works

The package is registered through pytest's pytest11 entry-point group:

[project.entry-points.pytest11]
why = "pytest_why.plugin"

That allows pytest to discover the package after installation.

The plugin module first defines the command-line option:

def pytest_addoption(parser):
    group = parser.getgroup("pytest-why")
    group.addoption(
        "--why",
        action="store_true",
        default=False,
        help="Explain failures and write pytest-why Markdown and HTML reports.",
    )

During configuration, the runtime collector is registered only when the flag is present:

def pytest_configure(config):
    if config.getoption("--why"):
        config.pluginmanager.register(WhyPlugin(), "pytest-why-runtime")

The runtime plugin listens to two report streams:

def pytest_runtest_logreport(self, report):
    self._record_failure(report, report.when)

def pytest_collectreport(self, report):
    self._record_failure(report, "collect")

pytest_runtest_logreport covers setup, call, and teardown reports. pytest_collectreport covers failures raised while collecting tests.

For each failed report, the plugin stores:

nodeid
phase
duration
longreprtext
type
title
explanation
hint

At the end of the session, pytest_terminal_summary prints the concise terminal view and sends the complete failure list to both report writers.

The flow is intentionally small:

pytest report
    -> failure collector
    -> classifier
    -> normalized failure record
    -> terminal + Markdown + HTML

Keeping classification separate from report rendering makes the behavior easier to test and allows future output formats to consume the same normalized data.

Why the Raw Traceback Is Preserved

An explanation is useful, but it is not a substitute for evidence.

Each Markdown and HTML report includes:

  • The test node ID.
  • The pytest phase.
  • The normalized classification.
  • The duration, when pytest provides one.
  • The explanation.
  • The suggested next step.
  • The complete raw traceback.

This gives the report two reading levels. A developer can scan the title and hint to triage several failures quickly, then expand the traceback for detailed investigation.

It also avoids a common problem in diagnostic tools: summarizing so aggressively that the original context disappears.

Markdown Reports for Engineering Workflows

The Markdown report is designed for systems where plain text is already the native format:

  • Pull-request descriptions.
  • GitHub issues.
  • Incident notes.
  • CI artifacts.
  • Team chat threads.
  • Internal documentation.

A report entry looks like:

## 1. `tests/test_checkout.py::test_total`

- **Phase:** `call`
- **Type:** `assertion_mismatch` - Assertion mismatch
- **Duration:** 0.012s

**Why:** The code ran, but the observed value or state did not match what the
test expected.

**Hint:** Compare the expected and actual values near the final assertion, then
trace where they first diverge.

The raw traceback is placed inside a collapsible <details> block.

There is a subtle implementation detail here: tracebacks can contain Markdown backticks. A fixed triple-backtick fence could be terminated by traceback content and corrupt the report. The reporter scans for the longest run of backticks and chooses a fence that is at least one character longer.

If the traceback contains:

```text
example

the outer report uses four backticks. This keeps arbitrary traceback text inside
the intended code block.

## Standalone HTML Reports

The HTML report provides the same data in a styled, portable document.

Each failure is rendered as a card with:

- Responsive metadata.
- A readable explanation and hint.
- A collapsible traceback.
- Light and dark color-scheme support.
- Wrapped node IDs and traceback content.

The report has no external stylesheet or JavaScript dependency, so one file
contains the entire result.

Raw failure data must be treated as untrusted content. Assertion messages and
tracebacks can include strings from web pages, APIs, fixtures, or user input.
The HTML writer escapes every dynamic field before inserting it into the
document:

```python
traceback=escape(str(failure.get("longreprtext", "")))

Without escaping, a traceback containing <script> or other HTML could change the report document. The test suite verifies that such content is displayed as text rather than interpreted as markup.

Collection Errors Are First-Class Failures

Many test-reporting tools focus only on executed test functions. That misses a major class of pytest failures: tests that could not be collected.

Consider:

import package_that_does_not_exist


def test_unreachable():
    pass

The test function never runs. Pytest raises a collection error while importing the module.

Because pytest-why listens to collection reports, it can still produce:

Import error: test_import_error.py (collect)
  Why: Python could not import a module or symbol required while collecting
       or running this test.
  Hint: Verify the package is installed, the import path is correct, and the
        symbol exists without a circular import.

This is one of the most important integration details in the package. Capturing only runtime reports would make the import-error classification incomplete.

Report Behavior in CI

A useful CI pattern is to preserve the generated files as artifacts:

- name: Run tests with explanations
  run: python -m pytest --why

- name: Upload pytest-why reports
  if: always()
  uses: actions/upload-artifact@v4
  with:
    name: pytest-why-reports
    path: |
      pytest-why-report.md
      pytest-why-report.html

The if: always() condition matters because pytest exits with a non-zero status when tests fail. The upload step should still run so the diagnostic report is available for the failed job.

The Markdown file can also be posted into a pull request by a separate workflow, while the HTML file remains a downloadable artifact for deeper inspection.

Testing the Plugin Itself

Pytest plugins need more than unit tests around helper functions. They should be tested through real pytest runs.

The pytest-why test suite uses pytest's pytester fixture to create temporary test modules and execute nested pytest sessions. These integration tests verify that:

  • --why prints the expected terminal summary.
  • The Markdown and HTML files are created.
  • A normal pytest run without --why remains unchanged.
  • Missing fixtures are classified during setup.
  • Import errors are captured during collection.

Separate classifier tests cover each supported category and browser-hint enrichment.

Reporter tests verify:

  • Duration formatting.
  • Markdown structure.
  • Safe handling of embedded backtick fences.
  • HTML escaping of traceback content.

This division gives the project coverage at three levels:

  1. Classification logic.
  2. Output rendering.
  3. End-to-end pytest integration.

Design Boundaries

pytest-why is intentionally focused. It does not attempt to understand every exception type or infer the root cause of arbitrary application failures.

The current model has several boundaries:

  • Classification is based on known textual signals in pytest's traceback representation.
  • Multiple failures are reported independently; the package does not currently group duplicate root causes.
  • Reports are written to the current working directory.
  • Output file names are fixed.
  • The classifier does not inspect application source code or runtime variables beyond what appears in the report.
  • Framework-specific context currently focuses on Selenium and Playwright.

These boundaries keep the package predictable, but they also identify clear future improvements:

  • User-defined classification rules.
  • Configurable report paths and formats.
  • Structured JSON output.
  • Grouping repeated failures by fingerprint.
  • Links from report frames to source repositories.
  • Additional guidance for database, API, concurrency, and infrastructure failures.

Any expansion should preserve the current fallback rule: when confidence is low, keep the original traceback central and avoid pretending to know more than the available evidence supports.

When pytest-why Is Most Useful

The plugin is particularly useful in:

  • Large regression suites where several unrelated failures appear together.
  • CI pipelines where developers need a quick triage view before opening the complete log.
  • Onboarding environments where contributors are still learning pytest phases and fixture behavior.
  • Browser automation projects where selectors and timing failures recur.
  • Pull requests where test evidence needs to be shared in a readable format.
  • Support workflows where the person investigating the failure did not run the test locally.

For a single obvious assertion, the explanation may simply confirm what an experienced developer already sees. The value grows when failures cross pytest phases, involve collection, or need to be communicated outside the terminal.

Getting Started

Install the latest release:

python -m pip install -U pytest-why

Run your suite:

python -m pytest --why

After the run, inspect:

pytest-why-report.md
pytest-why-report.html

Project links:

Final Thoughts

Good test diagnostics should reduce the distance between failure and action.

Pytest already provides strong failure evidence. pytest-why builds on that foundation by adding category, context, and a practical first debugging step, while preserving the complete traceback for deeper analysis.

The result is a small workflow change:

pytest --why

But that change turns a test run into something easier to scan, easier to share, and easier to investigate.

Get In Touch

Interested in collaborating or have a question about my projects? Feel free to reach out. I'm always open to discussing new ideas and opportunities.