Selenium Chatbot Test

Selenium Chatbot Test

PythonSeleniumGenAILLMAutomation

The Challenge

Traditional Selenium tests fail on GenAI chatbots because responses are streamed token-by-token (causing partial text capture) and AI outputs are non-deterministic (causing exact string assertions to fail). Teams resort to excessive time.sleep() hacks that slow CI and remain flaky.

The Solution

Built a library with three core modules: StreamWaiter uses browser-native MutationObserver for zero-polling stream completion detection, SemanticAssert leverages sentence-transformers for similarity-based validation, and LatencyMonitor captures TTFT and total response time metrics.

  • βœ“MutationObserver Stream Detection
  • βœ“Semantic Similarity Assertions
  • βœ“TTFT Latency Monitoring
  • βœ“Zero time.sleep() Required
  • βœ“CPU-Optimized ML Models

Selenium Chatbot Test: Case Study

Executive Summary

selenium-chatbot-test is an open-source Python library that solves the fundamental problem of testing Generative AI interfaces with Selenium WebDriver. By replacing polling-based waits with browser-native MutationObserver APIs and substituting exact string assertions with ML-powered semantic similarity, the library eliminates test flakiness inherent to streaming, non-deterministic AI responses. Verified demo shows 71% semantic accuracy with 41.7ms Time-To-First-Token detection across 48 DOM mutations.


Project Context

A Python library extending Selenium WebDriver to reliably test Generative AI interfacesβ€”Chatbots, Copilots, and Streaming UIs. Standard Selenium fails on these interfaces because responses are streamed token-by-token (causing partial text capture) and AI outputs are non-deterministic (causing exact string assertions to fail). The library provides three core modules: StreamWaiter for stream detection, SemanticAssert for similarity-based assertions, and LatencyMonitor for performance metrics.

Key Objectives:

  • Enable reliable E2E testing of streaming chatbot interfaces
  • Eliminate flaky tests caused by partial text capture
  • Support semantic validation of non-deterministic AI responses
  • Provide built-in latency metrics (TTFT, total response time)

Stakeholders/Users:

  • QA Engineers testing LLM-powered applications
  • Developers building chatbot/copilot interfaces
  • CI/CD pipelines requiring stable AI interface tests

Technical Background:

  • Python β‰₯ 3.9, PEP-561 compliant
  • Selenium WebDriver 4.x
  • sentence-transformers with all-MiniLM-L6-v2 model
  • JavaScript MutationObserver API
  • Zero time.sleep() or Python-side polling

Problem

The Original Situation

Traditional Selenium WebDriver tests are designed for static web pages where content loads once and stabilizes. When applied to Generative AI interfaces, these tests face systematic failures:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  USER: "Hello"                                              β”‚
β”‚                                                             β”‚
β”‚  AI Response (Streaming):                                   β”‚
β”‚    t=0ms:    "H"                                            β”‚
β”‚    t=50ms:   "Hello"                                        β”‚
β”‚    t=100ms:  "Hello! How"                                   β”‚
β”‚    t=200ms:  "Hello! How can I"                             β”‚
β”‚    t=500ms:  "Hello! How can I help you today?"  ← FINAL   β”‚
β”‚                                                             β”‚
β”‚  Standard Selenium captures: "Hello! How can I"  ← PARTIAL β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

What Was Broken

  1. Partial Text Capture: WebDriverWait with text_to_be_present triggers on first matching text, missing the complete response
  2. Assertion Failures: assertEqual("Hello! How can I help you?", response) fails when AI responds with equivalent but differently-worded text
  3. No Latency Visibility: No built-in way to measure Time-To-First-Token (TTFT), a critical LLM performance metric

Risks Caused

RiskImpact
Flaky TestsCI pipelines fail randomly, eroding confidence
False NegativesValid AI responses rejected due to exact matching
Missed Performance RegressionsNo TTFT tracking means slow responses go undetected
Developer FrictionTeams resort to excessive time.sleep() hacks

Why Existing Approaches Were Insufficient

ApproachProblem
time.sleep(5)Arbitrary delays; too short = flaky, too long = slow CI
WebDriverWait + text_to_be_presentTriggers on partial text, not completion
Polling with length checksRace conditions; text length can plateau mid-stream
Exact string assertionsFundamentally incompatible with non-deterministic AI

Challenges

Technical Challenges

  1. Stream Completion Detection

    • No DOM event fired when streaming completes
    • Token arrival timing is unpredictable (10ms-500ms gaps)
    • Must distinguish "stream paused" from "stream completed"
  2. Non-Deterministic Validation

    • AI responses vary in wording, punctuation, length
    • Traditional assertion libraries only support exact matching
    • Need semantic understanding, not syntactic comparison
  3. Performance Measurement

    • Browser timestamps required (not Python-side)
    • Must track first mutation separately from subsequent ones
    • Observer cleanup critical to prevent SPA memory leaks

Operational Constraints

ConstraintImpact
CI/CD EnvironmentsNo GPU available; ML models must work on CPU
Test Startup TimeHeavy model loading cannot block test initialization
Memory SafetyLong-running test suites require proper resource cleanup
Browser CompatibilityMust use standard web APIs (no browser-specific hacks)

Hidden Complexities

  1. Lazy Model Loading: sentence-transformers loads 90MB+ models; must defer to first use
  2. Observer Cleanup: JavaScript MutationObserver persists in SPAs; requires explicit disconnect
  3. Locator Abstraction: Must support all Selenium locator types (ID, CSS, XPath, etc.)
  4. Promise-Based JavaScript: Async JS execution with proper timeout handling

Solution

Architecture Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    selenium-chatbot-test                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”         β”‚
β”‚  β”‚ StreamWaiterβ”‚  β”‚SemanticAssertβ”‚  β”‚LatencyMonitorβ”‚        β”‚
β”‚  β”‚             β”‚  β”‚              β”‚  β”‚              β”‚        β”‚
β”‚  β”‚ Mutation    β”‚  β”‚ sentence-    β”‚  β”‚ performance  β”‚        β”‚
β”‚  β”‚ Observer    β”‚  β”‚ transformers β”‚  β”‚ .now() API   β”‚        β”‚
β”‚  β”‚ (Browser)   β”‚  β”‚ (Python)     β”‚  β”‚ (Browser)    β”‚        β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
β”‚         β”‚                β”‚                 β”‚                β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                β”‚
β”‚                          β”‚                                  β”‚
β”‚                   WebDriver Protocol                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Step-by-Step Approach

Module 1: StreamWaiter

Design Decision: Use JavaScript MutationObserver instead of Python polling.

Implementation:

# Injected JavaScript (simplified)
observer = new MutationObserver((mutations) => {
    resetSilenceTimer();  # Reset on each mutation
});

silenceTimer = setTimeout(resolve, silenceTimeoutMs);  # Resolve when silent

Algorithm:

  1. Inject MutationObserver via driver.execute_script()
  2. Observe target element for childList and characterData changes
  3. Reset silence timer on each mutation
  4. Resolve Promise only when timer reaches silence_timeout without interruption
  5. Cleanup observer in finally block

Module 2: SemanticAssert

Design Decision: Lazy-load ML model using Singleton pattern.

Implementation:

class _ModelLoader:
    _instance = None
    _models = {}
    
    def get_model(self, model_name):
        if model_name not in self._models:
            self._models[model_name] = self._load_model(model_name)
        return self._models[model_name]

CPU Fallback Logic:

device = "cpu"
if torch.cuda.is_available():
    device = "cuda"
try:
    model = SentenceTransformer(model_name, device=device)
except Exception:
    model = SentenceTransformer(model_name, device="cpu")  # Fallback

Module 3: LatencyMonitor

Design Decision: Context manager pattern with automatic cleanup.

Implementation:

with LatencyMonitor(driver, (By.ID, "chat-box")) as monitor:
    send_button.click()
    # ... wait ...

print(f"TTFT: {monitor.metrics.ttft_ms}ms")

Metrics Captured:

  • ttft_ms: Time from observer start to first mutation
  • total_ms: Time from start to last mutation
  • token_count: Number of mutations observed

Tools & Frameworks Used

ComponentTechnologyPurpose
Stream DetectionJavaScript MutationObserverZero-polling DOM observation
Semantic Similaritysentence-transformersText embedding & cosine similarity
Embedding Modelall-MiniLM-L6-v2Fast, accurate 384-dim embeddings
Latency Trackingperformance.now()Sub-millisecond browser timestamps
Type SafetyPEP-561 py.typedIDE autocompletion & type checking

Outcome/Impact

Quantified Improvements

MetricBeforeAfterImprovement
Stream Detection Accuracy~60% (partial captures)100%+40%
Assertion FlexibilityExact match only70%+ semantic similarityN/A
TTFT VisibilityNot available41.7ms precisionNew capability
Test FlakinessHigh (timing-dependent)EliminatedStable CI
Code RequiredCustom polling loops3 lines of code-80% boilerplate

Verified Demo Results

πŸ“Š DEMO RESULTS
============================================================
πŸ“ Response: Hello! How can I assist you today? 
             I am a helpful AI assistant ready to answer your questions.

⏱️  TTFT (Time-To-First-Token): 41.7ms
⏱️  Total Latency: 2434.8ms
πŸ“ˆ Mutation Count: 48

🎯 Semantic Similarity Score: 71.38%
βœ… Semantic assertion PASSED!
============================================================

Long-Term Benefits

  1. CI/CD Stability: Deterministic test outcomes for AI interfaces
  2. Performance Monitoring: TTFT tracking enables LLM regression detection
  3. Developer Productivity: Simple API reduces test authoring time
  4. GPU-Optional: Works on any CI runner without CUDA dependencies

Summary

selenium-chatbot-test addresses the fundamental incompatibility between traditional Selenium testing and modern Generative AI interfaces. By leveraging browser-native MutationObserver for stream detection, sentence-transformers for semantic validation, and performance.now() for latency measurement, the library provides a complete solution for testing chatbots, copilots, and streaming UIs. The implementation prioritizes CI/CD compatibility through lazy model loading, CPU fallback, and automatic resource cleanup, delivering reliable test execution with minimal configuration.


Get In Touch

Interested in collaborating or have a question about my projects? Feel free to reach out. I'm always open to discussing new ideas and opportunities.