What You'll Learn
- The Problem: Why WebDriverWait fails on streaming responses
- MutationObserver: Zero-polling stream detection in the browser
- Semantic Assertions: ML-powered validation for non-deterministic outputs
- TTFT Monitoring: Measuring Time-To-First-Token for LLM performance
You've built an automation suite for your new AI chatbot. The tests run. Then they fail. Randomly. The response was correctâyou can see it on the screenâbut your assertion says otherwise. Welcome to the nightmare of testing Generative AI interfaces with traditional Selenium.
The Fundamental Incompatibility
Traditional Selenium WebDriver tests are designed for static web pages where content loads once and stabilizes. AI chatbots break this assumption in two fundamental ways:
- Streaming Responses: Tokens arrive one-by-one over 2-5 seconds. Your `WebDriverWait` triggers on the first token, capturing partial text.
- Non-Deterministic Output: The same question yields different (but equivalent) answers. `assertEqual()` fails even when the response is correct.
User: "Hello"
AI Response (Streaming):
t=0ms: "H"
t=50ms: "Hello"
t=100ms: "Hello! How"
t=200ms: "Hello! How can I"
t=500ms: "Hello! How can I help you today?" â FINAL
Standard Selenium captures: "Hello! How can I" â PARTIAL (FAIL!)The Usual Hacks (And Why They Fail)
Every team tries the same workarounds:
- `time.sleep(5)`: Arbitrary. Too short = flaky. Too long = slow CI. Never works reliably.
- `text_to_be_present`: Triggers on first match, missing the complete response.
- Polling with length checks: Race conditions. Text length can plateau mid-stream.
- Exact string assertions: Fundamentally impossible with non-deterministic AI.
The Solution: Browser-Native Stream Detection
The key insight is that the browser already knows when streaming stopsâwe just need to listen. The MutationObserver API watches for DOM changes in real-time, directly in JavaScript. No Python polling. No arbitrary sleeps.
from selenium_chatbot_test import StreamWaiter
# Wait for the AI response to complete streaming
waiter = StreamWaiter(driver, (By.ID, "chat-response"))
response_text = waiter.wait_for_stable_text(
silence_timeout=500, # Consider "done" after 500ms of no changes
overall_timeout=30000 # Maximum wait time
)Under the hood, `StreamWaiter` injects a MutationObserver that resets a timer on every DOM mutation. Only when the timer reaches `silence_timeout` without interruption does it returnâguaranteeing you capture the complete response.
Semantic Assertions: Testing Meaning, Not Words
Once you have the full response, you face the second problem: AI outputs vary. The solution is semantic similarityâcomparing meaning instead of exact strings.
from selenium_chatbot_test import SemanticAssert
asserter = SemanticAssert()
# These all mean the same thingâand this assertion passes!
expected = "Hello! How can I help you today?"
actual = "Hi there! What can I assist you with?"
asserter.assert_similar(
expected,
actual,
threshold=0.7 # 70% semantic similarity required
)
# â
PASSES - Because they mean the same thingThe library uses `sentence-transformers` with the `all-MiniLM-L6-v2` model to generate embeddings and calculate cosine similarity. The model is lazy-loaded on first use and works on CPUâno GPU required in CI.
TTFT: The LLM Performance Metric You're Not Tracking
Time-To-First-Token (TTFT) is critical for user experience. A chatbot that takes 3 seconds to start responding feels broken, even if the total response time is acceptable. Most teams have zero visibility into this metric.
from selenium_chatbot_test import LatencyMonitor
with LatencyMonitor(driver, (By.ID, "chat-response")) as monitor:
send_button.click()
# ... wait for response ...
print(f"TTFT: {monitor.metrics.ttft_ms}ms") # 41.7ms
print(f"Total: {monitor.metrics.total_ms}ms") # 2434.8ms
print(f"Tokens: {monitor.metrics.token_count}") # 48 mutationsPutting It All Together
Here's a complete test that would be impossible with traditional Selenium:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium_chatbot_test import StreamWaiter, SemanticAssert, LatencyMonitor
def test_chatbot_greeting():
driver = webdriver.Chrome()
driver.get("https://my-chatbot.com")
# Type a message
input_box = driver.find_element(By.ID, "chat-input")
input_box.send_keys("Hello!")
# Monitor latency while waiting for response
with LatencyMonitor(driver, (By.ID, "response")) as monitor:
driver.find_element(By.ID, "send-btn").click()
# Wait for streaming to complete (no time.sleep!)
waiter = StreamWaiter(driver, (By.ID, "response"))
response = waiter.wait_for_stable_text(silence_timeout=500)
# Assert semantic meaning, not exact words
asserter = SemanticAssert()
asserter.assert_similar(
"Hello! How can I help you today?",
response,
threshold=0.7
)
# Verify performance SLA
assert monitor.metrics.ttft_ms < 200, "TTFT exceeded 200ms SLA"
driver.quit()Get Started
Stop fighting flaky AI tests. Start testing semantically.
pip install selenium-chatbot-test
