Automation

AI

Test Automation

Visual Sonar: Automating the "Unautomatable" — Remote Desktop GUI Automation with Computer Vision

December 08, 2025 5 min read

🎯

Vision-Based Automation

The Challenge: No DOM in Citrix/WVD
The Tech: Echolocation for GUIs (TAB + Diff)
The Benefit: Free, open-source alternative to UiPath
Architecture: Python + OpenCV + PyAutoGUI

Every automation engineer has faced this: you need to automate a legacy Windows application running inside Windows Virtual Desktop (WVD) or Citrix. You fire up Selenium... and realize there's no DOM. You try UiPath... and hit a $10,000 licensing wall. You consider AutoIt... and discover the coordinates break when someone changes their monitor.

The "DOM" Trap

Stop looking for selectors where they don't exist. If you can see it on screen, you can automate it—without DOM access.

🦇

🎯 The Problem: The Automation Black Hole

The challenge: How do you reliably identify form fields when all you have is a live video stream of a remote desktop?

Why Traditional Tools Fail

Code

| Tool | Problem with Remote Desktop |
|------|----------------------------|
| **Selenium/Playwright** | No DOM access—it's pixel-rendered stream |
| **UiPath/Blue Prism** | Expensive licenses, complex setup |
| **AutoIt/PyAutoGUI (basic)** | Hardcoded coordinates break with resolution/DPI changes |
| **Image Recognition** | Brittle—fails with font smoothing, themes, colors |

💡 The Solution: Echolocation for GUIs

Visual Sonar borrows from nature. Bats navigate in darkness using echolocation—they emit sounds and detect objects by analyzing the returning echoes. Visual Sonar does the same with form fields:

Code

┌─────────────────────────────────────────────────────┐
│  1. Take screenshot (BEFORE)                        │
│  2. Press TAB key                                   │
│  3. Take screenshot (AFTER)                         │
│  4. Diff the images → The changed region = FIELD   │
└─────────────────────────────────────────────────────┘

When a form field receives focus, most applications display a focus ring (blue border, highlight, etc.). By detecting *where the pixels changed*, we identify the field's exact coordinates.

Why This Works

Universal: Every accessible application shows focus indicators
Resolution-Independent: Coordinates are captured at runtime
Theme-Agnostic: We detect *change*, not specific colors
No Agents Required: Works purely from the client side

🏗️ Architecture Deep Dive

Core Components

Code

visual_sonar.py (985 lines)
├── VisualSonar class
│   ├── _detect_dpi_scale()      # Per-monitor DPI handling
│   ├── wait_for_stabilization() # Adaptive frame settling
│   ├── get_focus_region()       # Computer vision diff
│   ├── map_form()               # Interactive mapper
│   ├── run_automation()         # Execution engine
│   └── extract_text_*()         # OCR capabilities
├── _run_cli()                   # CLI command handler
└── wvd_simulator.py             # Safe testing environment

The Detection Algorithm

Code

def get_focus_region(self, img_before, img_after):
    # Convert to grayscale
    gray_a = cv2.cvtColor(np.array(img_before), cv2.COLOR_RGB2GRAY)
    gray_b = cv2.cvtColor(np.array(img_after), cv2.COLOR_RGB2GRAY)
    
    # Gaussian blur to reduce noise
    gray_a = cv2.GaussianBlur(gray_a, (5, 5), 0)
    gray_b = cv2.GaussianBlur(gray_b, (5, 5), 0)
    
    # Absolute difference
    diff = cv2.absdiff(gray_a, gray_b)
    
    # Dynamic thresholding (median + 15 for compression artifacts)
    median_val = np.median(diff)
    thresh_val = int(median_val + 15)
    _, thresh = cv2.threshold(diff, thresh_val, 255, cv2.THRESH_BINARY)
    
    # Morphological dilation to connect nearby changes
    kernel = np.ones((5, 5), np.uint8)
    dilated = cv2.dilate(thresh, kernel, iterations=2)
    
    # Find the largest contour = the field
    contours, _ = cv2.findContours(dilated, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    largest = max(contours, key=cv2.contourArea)
    
    return cv2.boundingRect(largest)  # (x, y, width, height)

Key techniques: Gaussian Blur reduces noise from RDP compression artifacts. Dynamic Thresholding uses median + 15 instead of a fixed value to adapt to different compression levels. Morphological Dilation connects fragmented focus ring pixels into a single region. Contour Analysis finds the largest changed area (the field) and ignores small cursor flickers.

🔧 Challenges & Solutions

Challenge 1: Screen Instability (RDP Lag)

Problem: Remote desktop connections have latency. Taking screenshots immediately after TAB may capture mid-render frames.

Solution: Adaptive Stabilization — The system waits for 3 consecutive frames with less than 20 pixels of change. This handles cursor blink animations, RDP decoder artifacts, and network jitter.

Code

def wait_for_stabilization(self):
    """Wait for N consecutive identical frames."""
    stable_count = 0
    while stable_count < 3:  # Need 3 stable frames
        time.sleep(0.1)
        curr = screenshot()
        diff = cv2.absdiff(last, curr)
        
        if cv2.countNonZero(diff) < 20:  # Micro-motion limit
            stable_count += 1
        else:
            stable_count = 0  # Reset on movement
            
        last = curr

Challenge 2: Multi-Monitor DPI Scaling

Problem: A map created on a 4K monitor (150% DPI) fails on a 1080p monitor (100% DPI).

Solution: Per-Window DPI Detection — Gets DPI from the actual RDP window, not the system default. The map stores DPI scale, and at runtime applies the scale ratio.

Code

def _detect_dpi_scale(self):
    try:
        # Get DPI from the actual RDP window, not system default
        rdp_windows = gw.getWindowsWithTitle('Remote Desktop')
        if rdp_windows:
            hwnd = rdp_windows[0]._hWnd
            dpi = ctypes.windll.user32.GetDpiForWindow(hwnd)
            return dpi / 96.0
    except:
        # Fallback to system DPI
        return ctypes.windll.user32.GetDpiForSystem() / 96.0

Challenge 3: Infinite Loop Detection

Problem: If a field has no visible focus ring (e.g., disabled element), the TAB loop never ends.

Solution: Three-Strikes Guard — Three consecutive TABs without visual change = end of form.

Code

no_change_count = 0

while True:
    res = get_focus_region(before, after)
    
    if res:
        no_change_count = 0  # Reset on success
        # ... map the field
    else:
        no_change_count += 1
        if no_change_count >= 3:
            print("⛔ No change for 3 TABs – assume end of form.")
            break

Challenge 4: CI/CD Headless Execution

Problem: `cv2.imshow()` crashes in headless servers (GitHub Actions, Docker).

Solution: Environment Variable Guard — Set `HEADLESS=1` to skip all GUI operations.

Code

if os.environ.get("HEADLESS", "0") == "0":
    cv2.imshow("Field Detected", debug_img)
    cv2.waitKey(1)

Challenge 5: Unicode & Special Characters

Problem: `pyautogui.write()` doesn't support Unicode characters (é, ñ, 中文).

Solution: Clipboard Paste — This handles any character the clipboard supports.

Code

import pyperclip

def type_text(value):
    pyautogui.hotkey('ctrl', 'a')      # Select all
    pyautogui.press('backspace')        # Clear
    pyperclip.copy(str(value))          # Copy to clipboard
    pyautogui.hotkey('ctrl', 'v')       # Paste

Challenge 6: Secrets in Logs

Problem: Passwords appearing in console output is a security risk.

Solution: Pre-Execution Scrubbing — Secrets are never logged.

Code

safe_data = copy.deepcopy(input_data)
for k, v in safe_data.items():
    if isinstance(v, str) and len(v) > 3:
        safe_data[k] = '*' * len(v)  # "password" → "********"

print(f"Input: {safe_data}")  # Secrets never logged

📊 Feature Matrix

Code

| Feature | Description | Status |
|---------|-------------|--------|
| **Visual Mapping** | TAB-based focus detection | ✅ Stable |
| **Field Types** | text, click, toggle, dropdown, double_click | ✅ Stable |
| **DPI Scaling** | Per-monitor awareness | ✅ Stable |
| **OCR Extraction** | EasyOCR / Pytesseract backends | ✅ Stable |
| **Batch Testing** | CSV/JSON data-driven runs | ✅ Stable |
| **Emergency Stop** | Mouse-to-corner failsafe | ✅ Stable |
| **Step Screenshots** | Debug capture per action | ✅ Stable |
| **Headless Mode** | CI/CD compatible | ✅ Stable |
| **Encrypted Input** | Fernet-based secrets | 🔜 Planned |
| **Cross-Platform** | Linux/Mac support | 🔜 Planned |

🎮 Usage Examples

Basic Workflow

Code

# 1. Install
pip install visual-sonar

# 2. Map your form
visual-sonar map

# 3. Create input file
echo '{"username": "test@co.com", "password": "secret", "signin": true}' > input.json

# 4. Execute
visual-sonar run

Data-Driven Testing

Code

# users.csv
username,password,signin
user1@co.com,pass1,true
user2@co.com,pass2,true
user3@co.com,pass3,true

# Run batch
visual-sonar batch users.csv

Runs the same form 3 times with different credentials.

OCR Verification

Code

# Extract text from screen region
visual-sonar extract region 100 200 300 50

# Verify expected text exists
visual-sonar extract verify 100 200 300 50 "Login Successful"

⚖️ Benefits vs. Drawbacks

✅ Benefits

Zero Licensing Fees: MIT licensed, completely free
No Agent Installation: Works client-side only
Works Everywhere: Any app with focus indicators
Portable Maps: JSON files version-control easily
Data-Driven: Batch testing built-in
DPI-Aware: Handles multi-monitor setups
Safe Testing: Built-in simulator for development

⚠️ Drawbacks

Focus Ring Required: Apps without visible focus indicators won't work
Sequential Only: Can't interact with multiple windows simultaneously
Network Dependent: RDP lag affects detection reliability
Windows-Only: Currently requires Windows APIs (pygetwindow, winsound)
Resolution Sensitivity: Significant resolution changes require remapping
No Conditional Logic: Linear execution only (no if/else branching)

🔐 Security Considerations

Never commit `input.json`: Contains credentials, always `.gitignore`
Use encrypted input (planned): Fernet-encrypted secrets file
Restrict screen sharing: Bot screencaps may expose data
Network isolation: Run on dedicated automation networks
Audit trails: All step screenshots provide forensic evidence

🚀 Real-World Use Cases

Legacy ERP Automation: Automate data entry into SAP, Oracle, or custom legacy systems running inside Citrix.
Test Automation: Functional testing of desktop applications where traditional tools fail.
Data Migration: Bulk data entry from CSV into systems without APIs.
Compliance Reporting: Scheduled screenshot capture of dashboards for audit trails.
Training Data Generation: Capture labeled form field data for ML model training.

🔮 Future Roadmap

Encrypted Secrets: Fernet-based input.json encryption
Linux/Mac Support: Cross-platform pygetwindow alternatives
Conditional Flows: If/else branching based on OCR results
Parallel Execution: Multi-window orchestration
Visual Diff Dashboard: Web UI for comparing runs
Auto-Healing: Fuzzy coordinate matching for minor UI changes

📚 Getting Started

Code

pip install visual-sonar
visual-sonar help

Documentation: USER_GUIDE.md

Source: github.com/godhiraj-code/wvdautomation

🙏 Acknowledgments

OpenCV for computer vision primitives
PyAutoGUI for cross-platform input simulation
EasyOCR for neural network-based text extraction
The countless hours debugging RDP compression artifacts 🐛

*Visual Sonar is released under the MIT License. Contributions welcome!*

About the Author

Dhiraj Das is a Senior Automation Consultant specializing in Python, AI, and Intelligent Quality Engineering. Beyond delivering enterprise solutions, he dedicates his free time to tackling complex automation challenges, publishing tools like sb-stealth-wrapper and lumos-shadowdom on PyPI.

November 25, 2025

Why Python is the Ultimate Choice for Automation Testing

PythonAutomation

November 27, 2025

The Architect’s Guide: Integrating LLMs into Python Automation Frameworks

ArchitectureLLM

November 28, 2025

How Python Automation Testers Can Be More Efficient & Build Rock-Solid Test Suites

Best PracticesPython

Share this article:

Visual Sonar: Automating the "Unautomatable" — Remote Desktop GUI Automation with Computer Vision

Vision-Based Automation

🎯 The Problem: The Automation Black Hole

Why Traditional Tools Fail

💡 The Solution: Echolocation for GUIs

Why This Works

🏗️ Architecture Deep Dive

Core Components

The Detection Algorithm

🔧 Challenges & Solutions

Challenge 1: Screen Instability (RDP Lag)

Challenge 2: Multi-Monitor DPI Scaling

Challenge 3: Infinite Loop Detection

Challenge 4: CI/CD Headless Execution

Challenge 5: Unicode & Special Characters

Challenge 6: Secrets in Logs

📊 Feature Matrix

🎮 Usage Examples

Basic Workflow

Data-Driven Testing

OCR Verification

⚖️ Benefits vs. Drawbacks

✅ Benefits

⚠️ Drawbacks

🔐 Security Considerations

🚀 Real-World Use Cases

🔮 Future Roadmap

📚 Getting Started

🙏 Acknowledgments

About the Author

You might also like

Why Python is the Ultimate Choice for Automation Testing

The Architect’s Guide: Integrating LLMs into Python Automation Frameworks

How Python Automation Testers Can Be More Efficient & Build Rock-Solid Test Suites

Get In Touch