Computer Use Agents

Evaluating Visual Grounding Models on Accounting Software

Jan 28, 2026
13 min read
Saumil Srivastava

Saumil Srivastava

Engineering Leader

Table Of Contents

Loading content outline...

Abstract

This blog evaluates visual grounding models on professional accounting software interfaces. Testing GroundNext-7B and ShowUI-2B across 595 grounding targets from four platforms (Odoo, QuickBooks Online, ERPNext, Dolibarr ), the results show a significant performance gap compared to published benchmark results.

GroundNext-7B achieves 13.28% accuracy on accounting UI compared to 50.2% on ScreenSpot-Pro—a 37 percentage point gap. ShowUI-2B achieves 0.15% accuracy on this benchmark. The 89-94% SOTA agreement rate indicates both models produce incorrect predictions on similar samples, suggesting ensemble strategies may have limited benefit.

Semantic aliasing (42.8%) emerges as the dominant failure mode, followed by icon scale (33.7%) and coordinate regression (23.4%). Platform variance is high: Odoo's modern OWL framework achieves 45.98% while ERPNext achieves only 3.70%, suggesting UI complexity significantly impacts grounding performance.

These findings suggest domain-specific fine-tuning may be beneficial for deploying visual grounding models in vertical enterprise applications.

1. Introduction

Visual grounding is the task of localizing UI elements from natural language descriptions, is foundational to Computer-Use Agents (CUAs). Recent models report impressive benchmark performance: ShowUI-2B achieves 7.7% on ScreenSpot-Pro, while GroundNext-7B (trained on 700K instruction pairs from the GroundCUA dataset) achieves 50.2% on the same benchmark [1].

This study validates this assumption on accounting software.

The evaluation covers SOTA grounding models on professional accounting software, a vertical enterprise domain characterized by dense information displays, small interactive elements, and repetitive form structures. The results show models achieving 50-75% on standard benchmarks achieve 0-13% on accounting interfaces.

1.1 Contributions

  1. Accounting-specific grounding benchmark spanning 4 platforms, 65 screenshots, and 595 validated targets
  2. Quantified domain gap of 37 percentage points(GroundNext-7B) between published and real-world accuracy
  3. Failure taxonomy identifying semantic aliasing, icon scale, and coordinate regression as primary failure modes
  4. Platform variance analysis revealing 42pp accuracy range (3.7%-46%) correlated with UI framework modernity

2.1 Visual Grounding Models

GroundNext-7B [1] extends Qwen2.5-VL-7B with training on the GroundCUA dataset (700K instruction pairs from 56K desktop screenshots). It achieves 50.2% (SFT) / 52.9% (RL) on ScreenSpot-Pro.

ShowUI-2B [2] introduces UI-guided token selection for efficient grounding, achieving 7.7% on ScreenSpot-Pro with only 2B parameters (75.1% on original ScreenSpot)

UGround-V1-7B [3] trains on 10M UI elements from 1.3M screenshots, achieving 31.1% on ScreenSpot-Pro.

2.2 GUI Benchmarks

  • ScreenSpot [4]: Web and mobile interfaces
  • ScreenSpot-Pro [5]: Professional software subset
  • Mind2Web [6]: Web navigation tasks

Existing benchmarks primarily evaluate consumer applications:

No benchmark specifically targets enterprise ERP/accounting software, despite its prevalence in business automation use cases.

3. Benchmark: accounting_grounding_v1

3.1 Platform Coverage

PlatformFrameworkScreenshotsTargetsDescription
OdooOWL2487*Open-source ERP
QuickBooks OnlineReact18193Commercial SaaS
ERPNextFrappe1081Open-source ERP
DolibarrPHP13234Open-source ERP
**Total**-**65****595**-

*77 Odoo samples excluded due to data quality issues (see Section 3.4)

3.2 Target Characteristics

  • Navigation (sidebar, menus, tabs)
  • List Navigation (table rows, pagination, filters)
  • Data Entry (form fields, inputs, dropdowns)
  • Authentication (login flows)

Task Categories:

  • Small: <0.1% viewport (toolbar icons, action buttons)
  • Medium: 0.1-1% viewport (form fields, menu items)
  • Large: >1% viewport (main content areas)

Element Sizes:

  • Easy: Large, clearly labeled elements
  • Medium: Standard elements with common labels
  • Hard: Small icons, ambiguous labels, dense regions

Difficulty Levels:

3.3 Annotation Format

  • Pixel-precise bounding box coordinates
  • Multiple instruction variants (generic, specific, verbose)
  • Element metadata (type, size, area ratio)
  • Ground truth center coordinates
  • Difficulty classification

Each target includes:

3.4 Data Quality

77 Odoo samples from `form_01`, `nav_01`, and `vendor_01` directories were excluded due to missing screenshots during evaluation. This affected Odoo's sample count (87 valid of 164 total) but did not impact QBO, ERPNext, or Dolibarr data.

4. Experimental Setup

4.1 Models Evaluated

GroundNext-7B (ServiceNow/GroundNext-7B-V0)

  • Architecture: Qwen2.5-VL-7B (8B parameters)
  • Output: Pixel coordinates via tool call JSON format
  • Published: 50.2% ScreenSpot-Pro (SFT)

ShowUI-2B (showlab/ShowUI-2B)

  • Architecture: Qwen2-VL-2B with UI-guided token selection
  • Output: Normalized coordinates [0, 1]
  • Published: 7.7% ScreenSpot-Pro

4.2 Evaluation Protocol

Metric: Point-in-box accuracy prediction is correct if (x, y) falls within ground truth bounding box.

Infrastructure: Modal Labs serverless GPU (NVIDIA A10G, 24GB VRAM)

  • GroundNext-7B: ~19s per sample (~4 hours total)
  • ShowUI-2B: ~2.2s per sample (~25 min total)

Inference Time:

4.3 Batch Processing

  • Batch size: 200 samples
  • Checkpoint saves after each batch
  • Resume capability on timeout
  • Total runtime: ~4.5 hours

Due to inference latency, batch processing was implemented:

5. Results

5.1 Overall Accuracy

ModelAccuracySamplesMean Latency
GroundNext-7B**13.28%**59519,060ms
ShowUI-2B**0.15%**6842,200ms

5.2 Domain Gap Quantification

ModelScreenSpot-ProAccounting UIGap
GroundNext-7B50.2%**13.28%****-36.9pp**
ShowUI-2B7.7%0.15%-7.55pp

This suggests models trained on general UI do not transfer well to accounting interfaces without additional fine-tuning.

5.3 Cross-Platform Performance

PlatformAccuracySamplesSuccessesFramework
Odoo**45.98%**8740OWL (Modern)
QBO9.84%19319React
ERPNext3.70%813Frappe
Dolibarr7.26%23417PHP (Legacy)

Key Finding: Platform variance spans 42.3 percentage points. Odoo's modern OWL framework achieves 12x higher accuracy than ERPNext, suggesting UI framework modernity significantly impacts grounding performance.

5.4 SOTA Agreement Analysis

Agreement TypePercentage
Both Fail89-94%
Both Succeed0-6%
GroundNext Only6-11%
ShowUI Only0%

The 89-94% agreement rate indicates models produce incorrect predictions on similar samples. This suggests ensemble strategies may have limited benefit when failure modes overlap.

5.5 Performance by Element Size

SizeArea RatioGroundNext-7BShowUI-2B
Large>1%**23.08%**0%
Medium0.1-1%9.68%0%
Small<0.1%8.93%0%

Even large elements achieve only 23% accuracy—far below the 50%+ expected from benchmark performance.

5.6 Performance by Task Category

CategoryGroundNext-7BShowUI-2B
Navigation**23.08%**0%
List Navigation0%0%
Data Entry0%0%

List navigation and data entry—core accounting workflows—show 0% accuracy across both models.

6. Failure Analysis

6.0 Visual Examples of Failure Cases

The following figures show example predictions on accounting UI. Green circles indicate ground truth locations; red X marks indicate model predictions.

Blog image

Figure 1: Semantic Aliasing - "Click Confirm Order" (1006px error) Model predicted (1082, 125) instead of ground truth (76, 129). The prediction landed on "Purchase Order" button on the right instead of the target action button on the left.

Blog image

Figure 2: Semantic Aliasing - "Click the apps menu" (882px error) Model predicted (906, 19) instead of ground truth (24, 23). The target was a small grid/apps icon (17x17px) in the top-left corner.

These examples illustrate the gap between benchmark performance and performance on enterprise software.

6.1 Failure Taxonomy

Failure TypeCountPercentageDescription
**Semantic Aliasing**221**42.8%**Confusion between similar labels
Icon Scale17433.7%Small elements not detected
Coordinate Regression12123.4%Correct element, imprecise coordinates

6.2 Platform-Specific Failure Patterns

PlatformSemantic AliasingIcon ScaleCoord. Regression
Odoo70.2%2.1%27.7%
QBO58.6%25.3%16.1%
ERPNext46.2%9.0%44.9%
Dolibarr23.0%**56.2%**20.7%

Dolibarr's legacy PHP interface shows icon scale dominance (56.2%), while QBO's React interface shows semantic aliasing dominance (58.6%).

6.3 Semantic Aliasing Examples

InstructionPredictionGround TruthError Type
"Click Account num header"Parent account headerAccount num headerWrong column
"Click Save button"Submit buttonSave buttonSimilar label
"Click account 101"Account 100 rowAccount 101 rowAdjacent row

Accounting forms contain repetitive, visually similar elements that may be challenging for models trained primarily on consumer UI.

6.4 Icon Scale Analysis

Element TypeTypical SizeViewport %Performance
Menu bar icons35x35px0.06%Low
Toolbar buttons25x25px0.03%Low
Action icons20x20px0.02%Low
Grid/hamburger17x17px0.01%Low

Accounting UI elements are 10-80x smaller than mobile/web design standards (iOS: 44x44pt minimum, Material: 36x36dp minimum).

7. Discussion

7.1 Why the Domain Gap Exists

  • ScreenSpot-Pro: Modern web apps, mobile interfaces, clean visual hierarchy
  • Accounting UI: Dense ERP interfaces, tiny toolbar icons, repetitive form fields

Training Distribution Mismatch:

Element Density: A typical accounting form contains 20+ interactive elements in a 1920x1080 viewport, many sharing identical visual styling. This density may exceed what current training datasets represent.

Domain Vocabulary: Instructions like "Click the Due Date field" use accounting terminology that may be underrepresented in general GUI training data. Disambiguating among multiple date-related fields on an invoice form may require domain-specific training.

7.2 Ensemble Considerations

High SOTA agreement (89-94%) suggests ensemble approaches may have limited benefit. Both models fail on similar samples because:

  1. Shared training distributions lack ERP exposure
  2. Icon scale failures affect all VLM-based grounders equally
  3. Semantic aliasing requires domain-specific disambiguation training

7.3 Platform Variance Implications

The 42pp accuracy range across platforms suggests:

  1. Modern UI frameworks (OWL, React) provide cleaner grounding signals
  2. Legacy interfaces (PHP) with inconsistent styling are harder to ground
  3. Fine-tuning should prioritize lowest-performing platforms (ERPNext, Dolibarr)

8. Recommendations

8.1 For CUA Practitioners

  1. Validate on target domain before vertical deployment. ScreenSpot-Pro performance may not predict accounting UI performance.
  2. Evaluate on target domain before deployment. Run real GPU inference on your specific application.
  3. Expect 37pp degradation when moving from consumer to enterprise applications.
  4. Consider domain-specific fine-tuning for accounting automation use cases.

8.2 For Fine-Tuning Strategy

Failure Mode%Training Data Focus
Semantic Aliasing42.8%Similar labels, form field disambiguation
Icon Scale33.7%Small icons, toolbar buttons, dense menus
Coordinate Regression23.4%Near-miss cases, bounding box precision

Collect 500+ training pairs balanced across platforms and failure modes. Prioritize hard examples from ERPNext (3.7%) and Dolibarr (7.3%).

9. Limitations & Discussion

9.1 ShowUI's Low Accuracy

ShowUI-2B achieved 0.15% (1/684 correct predictions). Potential evaluation bugs were investigated:

  • Output format verified: ShowUI outputs normalized [0,1] coordinates. Correct denormalization to pixel space was confirmed.
  • Inference validated: Manual inspection of predictions showed the model produces valid coordinate outputs, but they consistently land on incorrect elements.
  • Possible explanation: ShowUI was trained primarily on web/mobile UI. Accounting software's dense, small-element layouts may fall outside its effective training distribution. Others are welcome to reproduce and verify.

9.2 Sample Size

  • The consistency of results across 4 independent platforms suggests the pattern is real
  • The 37pp gap is large enough that sampling variance is unlikely to explain it
  • The full dataset is released for others to extend

595 valid targets across 65 screenshots is modest. Larger benchmarks would strengthen conclusions. However:

9.3 Odoo Data Quality

  • Odoo's higher accuracy is consistent with its modern OWL framework
  • The excluded samples were random (infrastructure failure, not selection bias)
  • Other platforms (QBO, ERPNext, Dolibarr) were unaffected

77 Odoo samples were excluded due to missing screenshots during Modal evaluation (file sync issue). This reduced Odoo from 164 to 87 samples. The 45.98% accuracy is based on valid samples only. Key points:

9.4 Distribution Shift vs. Benchmark Overfitting

One interpretation: models are overfit to ScreenSpot's distribution, not that accounting UI is "harder." This is better framed as distribution shift the training data for these models underrepresents enterprise software. This is expected given training datasets are scraped from consumer web/mobile apps. The finding is still actionable: if deploying to enterprise, evaluate on enterprise data.

9.5 Reproducibility

  • Dataset: huggingface.co/datasets/s4um1l/accounting_grounding_v1
  • Evaluation: Modal A10G GPU, standard inference
  • Metric: Point-in-box accuracy (prediction inside ground truth bounding box)

10. Conclusion

This study evaluated visual grounding models on professional accounting software. The results show a performance gap: GroundNext-7B drops from 50.2% (ScreenSpot-Pro) to 13.28% (accounting UI), while ShowUI-2B drops from 7.7% to 0.15%.

Summary of Findings:

MetricValue
Best Model Accuracy13.28% (GroundNext-7B)
Worst Model Accuracy0.15% (ShowUI-2B)
Domain Gap37-75pp
SOTA Agreement89-94%
Primary Failure ModeSemantic Aliasing (42.8%)
Platform Range3.70% - 45.98%

Current benchmarks primarily cover consumer applications.

Key Insight: The 42pp platform variance (Odoo 46% vs ERPNext 3.7%) suggests UI framework modernity significantly impacts grounding performance, providing a roadmap for prioritizing fine-tuning data collection.

Takeaway: Domain-specific evaluation and fine-tuning may be necessary for vertical enterprise applications. Benchmark performance on consumer applications may not generalize to enterprise software.

References

[1] ServiceNow Research. "GroundCUA: Grounded GUI Agents via Reinforcement Learning." arXiv:2511.07332, 2025.

[2] ShowLab. "ShowUI: One Vision-Language-Action Model for GUI Visual Agent." arXiv:2411.17465, 2024.

[3] OSU NLP. "UGround: Universal Visual Grounding for GUI Agents." ICLR 2025 Oral.

[4] Cheng et al. "ScreenSpot: A Large-scale Benchmark for Visual Grounding." 2024.

[5] Li et al. "ScreenSpot-Pro: Professional Software Benchmark for Visual Grounding." 2024.

[6] Deng et al. "Mind2Web: Towards a Generalist Agent for the Web." NeurIPS 2023.

Appendix A: Benchmark Statistics

MetricValue
Total Screenshots65
Total Targets684
Valid Targets595
Platforms4
Task Categories6
Resolution1920x1080
Evaluation InfrastructureModal A10G GPU
Total Compute Time~5 hours

Appendix B: Reproducibility

Benchmark Dataset: huggingface.co/datasets/s4um1l/accounting_grounding_v1

1accounting_grounding_v1/
2  odoo/       # 24 screenshots
3  qbo/        # 18 screenshots
4  erpnext/    # 10 screenshots
5  dolibarr/   # 13 screenshots
6

---

Benchmark version: accounting_grounding_v1 Models: GroundNext-7B (ServiceNow), ShowUI-2B (ShowLab) Infrastructure: Modal Labs A10G GPU Total samples: 595 valid / 684 total

Subscribe to the Newsletter

Get weekly insights on AI implementation, performance measurement, and technical case studies.

Join the Newsletter

Get weekly insights on AI implementation and technical case studies.

We respect your privacy. Unsubscribe at any time.

Related Articles

Milestone 1: Multi-Modal Perception for Computer-Use Agents

I'm building a computer-use agent against real enterprise UIs. Not an API wrapper—something that has to perceive interfaces, identify real elements, and act in a way a human can inspect and understand.

Jan 13, 2026

Milestone 2: Building Intelligence on Top of Automation

How bounded reasoning actually works, why format mismatches killed 70% accuracy, and what HITL approval really means in production.

Jan 22, 2026