Abstract
This blog evaluates visual grounding models on professional accounting software interfaces. Testing GroundNext-7B and ShowUI-2B across 595 grounding targets from four platforms (Odoo, QuickBooks Online, ERPNext, Dolibarr ), the results show a significant performance gap compared to published benchmark results.
GroundNext-7B achieves 13.28% accuracy on accounting UI compared to 50.2% on ScreenSpot-Pro—a 37 percentage point gap. ShowUI-2B achieves 0.15% accuracy on this benchmark. The 89-94% SOTA agreement rate indicates both models produce incorrect predictions on similar samples, suggesting ensemble strategies may have limited benefit.
Semantic aliasing (42.8%) emerges as the dominant failure mode, followed by icon scale (33.7%) and coordinate regression (23.4%). Platform variance is high: Odoo's modern OWL framework achieves 45.98% while ERPNext achieves only 3.70%, suggesting UI complexity significantly impacts grounding performance.
These findings suggest domain-specific fine-tuning may be beneficial for deploying visual grounding models in vertical enterprise applications.
1. Introduction
Visual grounding is the task of localizing UI elements from natural language descriptions, is foundational to Computer-Use Agents (CUAs). Recent models report impressive benchmark performance: ShowUI-2B achieves 7.7% on ScreenSpot-Pro, while GroundNext-7B (trained on 700K instruction pairs from the GroundCUA dataset) achieves 50.2% on the same benchmark [1].
This study validates this assumption on accounting software.
The evaluation covers SOTA grounding models on professional accounting software, a vertical enterprise domain characterized by dense information displays, small interactive elements, and repetitive form structures. The results show models achieving 50-75% on standard benchmarks achieve 0-13% on accounting interfaces.
1.1 Contributions
- Accounting-specific grounding benchmark spanning 4 platforms, 65 screenshots, and 595 validated targets
- Quantified domain gap of 37 percentage points(GroundNext-7B) between published and real-world accuracy
- Failure taxonomy identifying semantic aliasing, icon scale, and coordinate regression as primary failure modes
- Platform variance analysis revealing 42pp accuracy range (3.7%-46%) correlated with UI framework modernity
2. Related Work
2.1 Visual Grounding Models
GroundNext-7B [1] extends Qwen2.5-VL-7B with training on the GroundCUA dataset (700K instruction pairs from 56K desktop screenshots). It achieves 50.2% (SFT) / 52.9% (RL) on ScreenSpot-Pro.
ShowUI-2B [2] introduces UI-guided token selection for efficient grounding, achieving 7.7% on ScreenSpot-Pro with only 2B parameters (75.1% on original ScreenSpot)
UGround-V1-7B [3] trains on 10M UI elements from 1.3M screenshots, achieving 31.1% on ScreenSpot-Pro.
2.2 GUI Benchmarks
- ScreenSpot [4]: Web and mobile interfaces
- ScreenSpot-Pro [5]: Professional software subset
- Mind2Web [6]: Web navigation tasks
Existing benchmarks primarily evaluate consumer applications:
No benchmark specifically targets enterprise ERP/accounting software, despite its prevalence in business automation use cases.
3. Benchmark: accounting_grounding_v1
3.1 Platform Coverage
| Platform | Framework | Screenshots | Targets | Description |
|---|---|---|---|---|
| Odoo | OWL | 24 | 87* | Open-source ERP |
| QuickBooks Online | React | 18 | 193 | Commercial SaaS |
| ERPNext | Frappe | 10 | 81 | Open-source ERP |
| Dolibarr | PHP | 13 | 234 | Open-source ERP |
| **Total** | - | **65** | **595** | - |
*77 Odoo samples excluded due to data quality issues (see Section 3.4)
3.2 Target Characteristics
- Navigation (sidebar, menus, tabs)
- List Navigation (table rows, pagination, filters)
- Data Entry (form fields, inputs, dropdowns)
- Authentication (login flows)
Task Categories:
- Small: <0.1% viewport (toolbar icons, action buttons)
- Medium: 0.1-1% viewport (form fields, menu items)
- Large: >1% viewport (main content areas)
Element Sizes:
- Easy: Large, clearly labeled elements
- Medium: Standard elements with common labels
- Hard: Small icons, ambiguous labels, dense regions
Difficulty Levels:
3.3 Annotation Format
- Pixel-precise bounding box coordinates
- Multiple instruction variants (generic, specific, verbose)
- Element metadata (type, size, area ratio)
- Ground truth center coordinates
- Difficulty classification
Each target includes:
3.4 Data Quality
77 Odoo samples from `form_01`, `nav_01`, and `vendor_01` directories were excluded due to missing screenshots during evaluation. This affected Odoo's sample count (87 valid of 164 total) but did not impact QBO, ERPNext, or Dolibarr data.
4. Experimental Setup
4.1 Models Evaluated
GroundNext-7B (ServiceNow/GroundNext-7B-V0)
- Architecture: Qwen2.5-VL-7B (8B parameters)
- Output: Pixel coordinates via tool call JSON format
- Published: 50.2% ScreenSpot-Pro (SFT)
ShowUI-2B (showlab/ShowUI-2B)
- Architecture: Qwen2-VL-2B with UI-guided token selection
- Output: Normalized coordinates [0, 1]
- Published: 7.7% ScreenSpot-Pro
4.2 Evaluation Protocol
Metric: Point-in-box accuracy prediction is correct if (x, y) falls within ground truth bounding box.
Infrastructure: Modal Labs serverless GPU (NVIDIA A10G, 24GB VRAM)
- GroundNext-7B: ~19s per sample (~4 hours total)
- ShowUI-2B: ~2.2s per sample (~25 min total)
Inference Time:
4.3 Batch Processing
- Batch size: 200 samples
- Checkpoint saves after each batch
- Resume capability on timeout
- Total runtime: ~4.5 hours
Due to inference latency, batch processing was implemented:
5. Results
5.1 Overall Accuracy
| Model | Accuracy | Samples | Mean Latency |
|---|---|---|---|
| GroundNext-7B | **13.28%** | 595 | 19,060ms |
| ShowUI-2B | **0.15%** | 684 | 2,200ms |
5.2 Domain Gap Quantification
| Model | ScreenSpot-Pro | Accounting UI | Gap |
|---|---|---|---|
| GroundNext-7B | 50.2% | **13.28%** | **-36.9pp** |
| ShowUI-2B | 7.7% | 0.15% | -7.55pp |
This suggests models trained on general UI do not transfer well to accounting interfaces without additional fine-tuning.
5.3 Cross-Platform Performance
| Platform | Accuracy | Samples | Successes | Framework |
|---|---|---|---|---|
| Odoo | **45.98%** | 87 | 40 | OWL (Modern) |
| QBO | 9.84% | 193 | 19 | React |
| ERPNext | 3.70% | 81 | 3 | Frappe |
| Dolibarr | 7.26% | 234 | 17 | PHP (Legacy) |
Key Finding: Platform variance spans 42.3 percentage points. Odoo's modern OWL framework achieves 12x higher accuracy than ERPNext, suggesting UI framework modernity significantly impacts grounding performance.
5.4 SOTA Agreement Analysis
| Agreement Type | Percentage |
|---|---|
| Both Fail | 89-94% |
| Both Succeed | 0-6% |
| GroundNext Only | 6-11% |
| ShowUI Only | 0% |
The 89-94% agreement rate indicates models produce incorrect predictions on similar samples. This suggests ensemble strategies may have limited benefit when failure modes overlap.
5.5 Performance by Element Size
| Size | Area Ratio | GroundNext-7B | ShowUI-2B |
|---|---|---|---|
| Large | >1% | **23.08%** | 0% |
| Medium | 0.1-1% | 9.68% | 0% |
| Small | <0.1% | 8.93% | 0% |
Even large elements achieve only 23% accuracy—far below the 50%+ expected from benchmark performance.
5.6 Performance by Task Category
| Category | GroundNext-7B | ShowUI-2B |
|---|---|---|
| Navigation | **23.08%** | 0% |
| List Navigation | 0% | 0% |
| Data Entry | 0% | 0% |
List navigation and data entry—core accounting workflows—show 0% accuracy across both models.
6. Failure Analysis
6.0 Visual Examples of Failure Cases
The following figures show example predictions on accounting UI. Green circles indicate ground truth locations; red X marks indicate model predictions.

Figure 1: Semantic Aliasing - "Click Confirm Order" (1006px error) Model predicted (1082, 125) instead of ground truth (76, 129). The prediction landed on "Purchase Order" button on the right instead of the target action button on the left.

Figure 2: Semantic Aliasing - "Click the apps menu" (882px error) Model predicted (906, 19) instead of ground truth (24, 23). The target was a small grid/apps icon (17x17px) in the top-left corner.
These examples illustrate the gap between benchmark performance and performance on enterprise software.
6.1 Failure Taxonomy
| Failure Type | Count | Percentage | Description |
|---|---|---|---|
| **Semantic Aliasing** | 221 | **42.8%** | Confusion between similar labels |
| Icon Scale | 174 | 33.7% | Small elements not detected |
| Coordinate Regression | 121 | 23.4% | Correct element, imprecise coordinates |
6.2 Platform-Specific Failure Patterns
| Platform | Semantic Aliasing | Icon Scale | Coord. Regression |
|---|---|---|---|
| Odoo | 70.2% | 2.1% | 27.7% |
| QBO | 58.6% | 25.3% | 16.1% |
| ERPNext | 46.2% | 9.0% | 44.9% |
| Dolibarr | 23.0% | **56.2%** | 20.7% |
Dolibarr's legacy PHP interface shows icon scale dominance (56.2%), while QBO's React interface shows semantic aliasing dominance (58.6%).
6.3 Semantic Aliasing Examples
| Instruction | Prediction | Ground Truth | Error Type |
|---|---|---|---|
| "Click Account num header" | Parent account header | Account num header | Wrong column |
| "Click Save button" | Submit button | Save button | Similar label |
| "Click account 101" | Account 100 row | Account 101 row | Adjacent row |
Accounting forms contain repetitive, visually similar elements that may be challenging for models trained primarily on consumer UI.
6.4 Icon Scale Analysis
| Element Type | Typical Size | Viewport % | Performance |
|---|---|---|---|
| Menu bar icons | 35x35px | 0.06% | Low |
| Toolbar buttons | 25x25px | 0.03% | Low |
| Action icons | 20x20px | 0.02% | Low |
| Grid/hamburger | 17x17px | 0.01% | Low |
Accounting UI elements are 10-80x smaller than mobile/web design standards (iOS: 44x44pt minimum, Material: 36x36dp minimum).
7. Discussion
7.1 Why the Domain Gap Exists
- ScreenSpot-Pro: Modern web apps, mobile interfaces, clean visual hierarchy
- Accounting UI: Dense ERP interfaces, tiny toolbar icons, repetitive form fields
Training Distribution Mismatch:
Element Density: A typical accounting form contains 20+ interactive elements in a 1920x1080 viewport, many sharing identical visual styling. This density may exceed what current training datasets represent.
Domain Vocabulary: Instructions like "Click the Due Date field" use accounting terminology that may be underrepresented in general GUI training data. Disambiguating among multiple date-related fields on an invoice form may require domain-specific training.
7.2 Ensemble Considerations
High SOTA agreement (89-94%) suggests ensemble approaches may have limited benefit. Both models fail on similar samples because:
- Shared training distributions lack ERP exposure
- Icon scale failures affect all VLM-based grounders equally
- Semantic aliasing requires domain-specific disambiguation training
7.3 Platform Variance Implications
The 42pp accuracy range across platforms suggests:
- Modern UI frameworks (OWL, React) provide cleaner grounding signals
- Legacy interfaces (PHP) with inconsistent styling are harder to ground
- Fine-tuning should prioritize lowest-performing platforms (ERPNext, Dolibarr)
8. Recommendations
8.1 For CUA Practitioners
- Validate on target domain before vertical deployment. ScreenSpot-Pro performance may not predict accounting UI performance.
- Evaluate on target domain before deployment. Run real GPU inference on your specific application.
- Expect 37pp degradation when moving from consumer to enterprise applications.
- Consider domain-specific fine-tuning for accounting automation use cases.
8.2 For Fine-Tuning Strategy
| Failure Mode | % | Training Data Focus |
|---|---|---|
| Semantic Aliasing | 42.8% | Similar labels, form field disambiguation |
| Icon Scale | 33.7% | Small icons, toolbar buttons, dense menus |
| Coordinate Regression | 23.4% | Near-miss cases, bounding box precision |
Collect 500+ training pairs balanced across platforms and failure modes. Prioritize hard examples from ERPNext (3.7%) and Dolibarr (7.3%).
9. Limitations & Discussion
9.1 ShowUI's Low Accuracy
ShowUI-2B achieved 0.15% (1/684 correct predictions). Potential evaluation bugs were investigated:
- Output format verified: ShowUI outputs normalized [0,1] coordinates. Correct denormalization to pixel space was confirmed.
- Inference validated: Manual inspection of predictions showed the model produces valid coordinate outputs, but they consistently land on incorrect elements.
- Possible explanation: ShowUI was trained primarily on web/mobile UI. Accounting software's dense, small-element layouts may fall outside its effective training distribution. Others are welcome to reproduce and verify.
9.2 Sample Size
- The consistency of results across 4 independent platforms suggests the pattern is real
- The 37pp gap is large enough that sampling variance is unlikely to explain it
- The full dataset is released for others to extend
595 valid targets across 65 screenshots is modest. Larger benchmarks would strengthen conclusions. However:
9.3 Odoo Data Quality
- Odoo's higher accuracy is consistent with its modern OWL framework
- The excluded samples were random (infrastructure failure, not selection bias)
- Other platforms (QBO, ERPNext, Dolibarr) were unaffected
77 Odoo samples were excluded due to missing screenshots during Modal evaluation (file sync issue). This reduced Odoo from 164 to 87 samples. The 45.98% accuracy is based on valid samples only. Key points:
9.4 Distribution Shift vs. Benchmark Overfitting
One interpretation: models are overfit to ScreenSpot's distribution, not that accounting UI is "harder." This is better framed as distribution shift the training data for these models underrepresents enterprise software. This is expected given training datasets are scraped from consumer web/mobile apps. The finding is still actionable: if deploying to enterprise, evaluate on enterprise data.
9.5 Reproducibility
- Dataset: huggingface.co/datasets/s4um1l/accounting_grounding_v1
- Evaluation: Modal A10G GPU, standard inference
- Metric: Point-in-box accuracy (prediction inside ground truth bounding box)
10. Conclusion
This study evaluated visual grounding models on professional accounting software. The results show a performance gap: GroundNext-7B drops from 50.2% (ScreenSpot-Pro) to 13.28% (accounting UI), while ShowUI-2B drops from 7.7% to 0.15%.
Summary of Findings:
| Metric | Value |
|---|---|
| Best Model Accuracy | 13.28% (GroundNext-7B) |
| Worst Model Accuracy | 0.15% (ShowUI-2B) |
| Domain Gap | 37-75pp |
| SOTA Agreement | 89-94% |
| Primary Failure Mode | Semantic Aliasing (42.8%) |
| Platform Range | 3.70% - 45.98% |
Current benchmarks primarily cover consumer applications.
Key Insight: The 42pp platform variance (Odoo 46% vs ERPNext 3.7%) suggests UI framework modernity significantly impacts grounding performance, providing a roadmap for prioritizing fine-tuning data collection.
Takeaway: Domain-specific evaluation and fine-tuning may be necessary for vertical enterprise applications. Benchmark performance on consumer applications may not generalize to enterprise software.
References
[1] ServiceNow Research. "GroundCUA: Grounded GUI Agents via Reinforcement Learning." arXiv:2511.07332, 2025.
[2] ShowLab. "ShowUI: One Vision-Language-Action Model for GUI Visual Agent." arXiv:2411.17465, 2024.
[3] OSU NLP. "UGround: Universal Visual Grounding for GUI Agents." ICLR 2025 Oral.
[4] Cheng et al. "ScreenSpot: A Large-scale Benchmark for Visual Grounding." 2024.
[5] Li et al. "ScreenSpot-Pro: Professional Software Benchmark for Visual Grounding." 2024.
[6] Deng et al. "Mind2Web: Towards a Generalist Agent for the Web." NeurIPS 2023.
Appendix A: Benchmark Statistics
| Metric | Value |
|---|---|
| Total Screenshots | 65 |
| Total Targets | 684 |
| Valid Targets | 595 |
| Platforms | 4 |
| Task Categories | 6 |
| Resolution | 1920x1080 |
| Evaluation Infrastructure | Modal A10G GPU |
| Total Compute Time | ~5 hours |
Appendix B: Reproducibility
Benchmark Dataset: huggingface.co/datasets/s4um1l/accounting_grounding_v1
1accounting_grounding_v1/
2 odoo/ # 24 screenshots
3 qbo/ # 18 screenshots
4 erpnext/ # 10 screenshots
5 dolibarr/ # 13 screenshots
6---
Benchmark version: accounting_grounding_v1 Models: GroundNext-7B (ServiceNow), ShowUI-2B (ShowLab) Infrastructure: Modal Labs A10G GPU Total samples: 595 valid / 684 total
