pytest-mockllm

pytest-mockllm

PythonPyPILLMpytestTestingAI

The Challenge

Testing LLM integrations is expensive, slow, and flaky. Async support in existing mocks is unreliable, token counting is inaccurate, and VCR-style recordings risk leaking API keys. Teams cannot write reliable unit tests for AI features without hitting real APIs.

The Solution

Built a pytest plugin with native async coroutines for OpenAI, Anthropic, Gemini, and LangChain. Integrated tiktoken for >99% token accuracy, implemented automatic PII redaction before cassette storage, and added chaos tools for simulating rate limits and network jitter.

  • ✓True Async & Await
  • ✓Pro Tokenizers (tiktoken)
  • ✓PII Redaction
  • ✓Chaos Engineering (Rate Limits, Timeouts)
  • ✓ROI Dashboard
  • ✓Python 3.14 Support

Case Study: Engineering "True Fidelity" in pytest-mockllm v0.2.1

The Challenge

When we first released pytest-mockllm, our async support was a "best-effort" wrapper around synchronous mocks. While this worked for simple cases, it failed in production-grade environments where developers used:

  • Complex coroutine orchestration
  • Asynchronous generators for streaming
  • Strict type checking (MyPy)
  • LangChain's astream and ainvoke interfaces

Additionally, users were concerned about the security of VCR-style recordings in enterprise environments, where API keys could accidentally leak into git history.

The Solution: v0.2.1 "True Fidelity"

1. Re-engineering Async Core

We moved away from simple MagicMock wrappers. In v0.2.1, every provider mock (OpenAI, Anthropic, Gemini) now implements native async def methods that return real coroutines. This ensures that await calls behave exactly as they do with real SDKs.

For streaming, we implemented custom AsyncIterator classes that mimic the SSE (Server-Sent Events) behavior of LLM providers.

2. Market-Leading Tokenomics

Standard character-based token estimation is often off by 20-30%. By integrating tiktoken (OpenAI) and custom heuristics (Anthropic), we brought our accuracy to >99% for standard models. This allows developers to write precise assertions on usage and cost.

3. PII Redaction by Default

Security should never be an afterthought. We implemented a PIIRedactor that automatically scrubs:

  • api_key and sk-... strings
  • Authorization: Bearer ... headers
  • Sensitive parameters in request bodies

This redaction happens before the cassette is ever written to disk, ensuring zero leak risk.

Outcomes

  • Zero Flakiness: True async support eliminated TypeError and "coroutine not awaited" bugs in CI.
  • Enterprise Ready: Secure recording allows teams to share cassettes without security risk.
  • Future Proof: Full verification against Python 3.14 ensures the library is ready for the next decade of AI development.

Built with passion for the AI testing community.

Get In Touch

Interested in collaborating or have a question about my projects? Feel free to reach out. I'm always open to discussing new ideas and opportunities.