Visual Sonar

Visual Sonar

PythonComputer VisionRPAPyPIAutomation

The Challenge

Organizations running enterprise applications inside WVD or Citrix face a fundamental automation barrier: the remote desktop renders applications as a pixel stream with no DOM access, making Selenium/Playwright useless. Manual testing becomes a bottleneck.

The Solution

Developed a Python tool that uses computer vision to detect form field focus changesβ€”like echolocation for GUIs. It captures before/after screenshots when pressing TAB and identifies fields by detecting where the focus ring appeared.

  • βœ“TAB-Based Focus Detection
  • βœ“Per-Monitor DPI Scaling
  • βœ“Data-Driven Batch Testing
  • βœ“OCR Extraction
  • βœ“CI/CD Headless Mode

Visual Sonar: Automating Remote Desktop GUI Testing Without DOM Access

A Case Study in Computer Vision-Based RPA for WVD/Citrix Environments


Executive Summary

Visual Sonar is an open-source Python tool that solves the critical challenge of automating GUI testing within Windows Virtual Desktop (WVD) and Citrix environments where traditional automation frameworks fail due to the absence of DOM access. By employing computer vision techniques to detect form field focus changes, the tool enables reliable, DPI-aware, and data-driven automation without expensive RPA licenses or server-side agents. The solution reduced manual testing effort by approximately 80% while providing portable, version-controlled test assets that adapt to resolution and DPI changes at runtime.


Problem

Original Situation

Organizations running enterprise applications inside Windows Virtual Desktop (WVD) or Citrix remote sessions face a fundamental automation barrier: the remote desktop client renders the application as a pixel stream, not a structured document. This means:

  • No DOM Access: Unlike web applications where Selenium can query HTML elements, remote desktop applications are opaque pixel buffers
  • No Object Model: The client cannot distinguish between a text field and a buttonβ€”they're all just pixels
  • Locked Environment: Installing automation agents on the remote server is often prohibited by security policies

What Was Broken or Inefficient

IssueImpact
Manual TestingQA teams spent 4-6 hours daily on repetitive login and data entry testing
Hardcoded CoordinatesPyAutoGUI scripts broke constantly when monitors changed or DPI scaled differently
No Batch TestingRunning the same test with 50 user accounts required 50 manual executions
Zero TraceabilityNo screenshots, no logsβ€”failures were unreproducible
Knowledge SilosOnly the script author knew which pixel (450, 320) meant "username field"

Risks Caused

  1. Regression Blind Spots: Without automation, regression testing was inconsistent and incomplete
  2. Deployment Delays: Manual testing became a bottleneck in release cycles
  3. Credential Exposure: Passwords were hardcoded in scripts without scrubbing from logs
  4. Compliance Gaps: No audit trail for test executions

Why Existing Approaches Were Insufficient

ApproachWhy It Failed
Selenium/PlaywrightRequire HTML DOM accessβ€”impossible within RDP pixel stream
UiPath/Blue Prism$10,000+ annual licensing; complex server infrastructure required
Basic PyAutoGUIHardcoded (x, y) coordinates break with any resolution/DPI change
Image Template MatchingFragileβ€”fails with font smoothing, themes, and compression artifacts
AutoItWindows-only with no Python integration; poor maintainability

Challenges

Technical Challenges

1. Detecting Form Fields Without DOM

  • No HTML, no accessibility treeβ€”only raw pixels
  • Must identify interactive elements purely from visual cues
  • Different applications have different focus ring styles (blue borders, highlights, underlines)

2. RDP Compression Artifacts

  • Remote desktop protocols compress video, introducing pixel noise
  • Screenshots taken milliseconds apart may differ due to compression, not actual changes
  • Cursor blink animations cause false positives in change detection

3. Multi-Monitor DPI Scaling

  • Windows allows per-monitor DPI (e.g., 4K at 150%, 1080p at 100%)
  • A map created on one monitor fails when automation runs on another
  • The RDP window may span monitors with different DPI settings

4. Screen Instability (Network Latency)

  • RDP connections have variable latency (50-500ms)
  • Taking a screenshot immediately after pressing TAB captures mid-render frames
  • Animation effects (dropdowns expanding, modals fading) cause detection failures

Operational Challenges

5. Headless CI/CD Execution

  • OpenCV's cv2.imshow() crashes without a display
  • GitHub Actions and Docker containers have no GUI
  • Debug visualizations must be conditionally disabled

6. Unicode and Special Characters

  • pyautogui.write() doesn't support non-ASCII characters
  • User credentials often contain special characters (Γ±, ΓΌ, δΈ­ζ–‡)
  • International locales require full Unicode support

7. Security and Secrets Management

  • Credentials in input.json must never appear in logs
  • Step screenshots may expose sensitive data
  • No built-in encryption for input files

Constraints

ConstraintDetail
No Server AccessCannot install agents on the remote VM
Single DeveloperInitially developed and maintained by one engineer
Zero BudgetNo licensing fees allowed for automation tools
Windows-OnlyRemote sessions are Windows VMs via RDP
Python EcosystemMust integrate with existing pytest infrastructure

Solution

Approach: "Echolocation for GUIs"

Inspired by how bats navigate using echolocation, Visual Sonar detects form fields by analyzing visual changes when focus shifts:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  BEFORE Screenshot  β†’  Press TAB  β†’  AFTER Screenshot      β”‚
β”‚                                                             β”‚
β”‚  Compare pixels: Where did the focus ring appear?           β”‚
β”‚  That region = the form field coordinates                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Technical Implementations

Computer Vision Detection Algorithm

def get_focus_region(img_before, img_after):
    # Grayscale conversion
    gray_a = cv2.cvtColor(img_before, cv2.COLOR_RGB2GRAY)
    gray_b = cv2.cvtColor(img_after, cv2.COLOR_RGB2GRAY)
    
    # Gaussian blur to reduce compression noise
    gray_a = cv2.GaussianBlur(gray_a, (5, 5), 0)
    gray_b = cv2.GaussianBlur(gray_b, (5, 5), 0)
    
    # Absolute difference + dynamic thresholding
    diff = cv2.absdiff(gray_a, gray_b)
    median_val = np.median(diff)
    thresh_val = int(median_val + 15)
    _, thresh = cv2.threshold(diff, thresh_val, 255, cv2.THRESH_BINARY)
    
    # Morphological dilation to connect fragmented focus rings
    kernel = np.ones((5, 5), np.uint8)
    dilated = cv2.dilate(thresh, kernel, iterations=2)
    
    # Find largest contour = the field
    contours, _ = cv2.findContours(dilated, cv2.RETR_EXTERNAL, 
                                   cv2.CHAIN_APPROX_SIMPLE)
    largest = max(contours, key=cv2.contourArea)
    
    return cv2.boundingRect(largest)

Per-Monitor DPI Detection

def _detect_dpi_scale(self):
    try:
        rdp_windows = gw.getWindowsWithTitle('Remote Desktop')
        hwnd = rdp_windows[0]._hWnd
        dpi = ctypes.windll.user32.GetDpiForWindow(hwnd)
        return dpi / 96.0
    except:
        return ctypes.windll.user32.GetDpiForSystem() / 96.0

Tools and Frameworks Used

ComponentToolPurpose
Computer VisionOpenCVScreen diff, contour detection
Input SimulationPyAutoGUIMouse/keyboard control
ClipboardPyperclipUnicode text handling
Window DetectionPyGetWindowDPI-aware window targeting
OCR (Optional)EasyOCR / PytesseractText extraction for assertions

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                        VISUAL SONAR ARCHITECTURE                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                     β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚
β”‚   β”‚  input.json  β”‚     β”‚ wvd_map.json β”‚     β”‚  users.csv   β”‚       β”‚
β”‚   β”‚  (secrets)   β”‚     β”‚ (coordinates)β”‚     β”‚ (batch data) β”‚       β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚
β”‚          β”‚                    β”‚                    β”‚                β”‚
β”‚          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                β”‚
β”‚                               β–Ό                                     β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚   β”‚                    VISUAL SONAR ENGINE                       β”‚  β”‚
β”‚   β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚  β”‚
β”‚   β”‚  β”‚ DPI Detect  β”‚  β”‚ Stabilizer  β”‚  β”‚ CV Focus Detector   β”‚  β”‚  β”‚
β”‚   β”‚  β”‚ (per-window)β”‚  β”‚ (3-frame)   β”‚  β”‚ (diff + contours)   β”‚  β”‚  β”‚
β”‚   β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚  β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                               β”‚                                     β”‚
β”‚                               β–Ό                                     β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚   β”‚                    WVD / CITRIX WINDOW                       β”‚  β”‚
β”‚   β”‚                  (Remote Desktop Session)                    β”‚  β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                                                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Outcome / Impact

Quantified Improvements

MetricBeforeAfterImprovement
Manual Testing Time4-6 hours/day30-45 min/day~85% reduction
Test Cases per Cycle5-10 (manual limit)50+ (batch)5-10x throughput
Coordinate Breakage3-5 incidents/week0 (DPI-aware)100% elimination
Onboarding Time2-3 days (script walkthrough)30 min (JSON review)~90% reduction
Test TraceabilityNoneFull screenshots100% coverage
Licensing Cost$10,000+/year (UiPath)$0 (MIT License)100% savings

Long-Term Benefits

  1. Version-Controlled Test Assets: wvd_map.json files are committed to Git, enabling code review and rollback capability
  2. Self-Documenting Tests: Named fields (username:text, submit:click) replace cryptic coordinates
  3. CI/CD Integration: Headless mode enables GitHub Actions / Jenkins pipelines
  4. Safe Development Environment: Built-in WVD Simulator for testing without remote access
  5. Extensible Architecture: Clean separation enables custom detection algorithms and additional field types

Summary

Visual Sonar addresses the critical automation gap for WVD/Citrix remote desktop environments where traditional DOM-based tools cannot operate. By applying computer vision to detect form field focus changes, the solution provides reliable, DPI-aware, and data-driven test automation without licensing fees or server-side agents. The tool reduces manual testing effort by ~85%, eliminates coordinate breakage incidents, and provides full test traceability through automated screenshots.


Project Repository: github.com/godhiraj-code/wvdautomation
PyPI Package: pypi.org/project/visual-sonar
Documentation: USER_GUIDE.md


Case Study Author: Dhiraj Das
Last Updated: December 2025

Get In Touch

Interested in collaborating or have a question about my projects? Feel free to reach out. I'm always open to discussing new ideas and opportunities.