Poster Tue, Jul 15, 2025 • 11:00 AM – 1:30 PM PDT

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

Perampalli Shravan Nayak · Xiangru Jian · Kevin Qinghong Lin · Juan A. Rodriguez · Montek Kalsi · Nicolas Chapados · M. Özsu · Aishwarya Agrawal · David Vazquez · Christopher Pal · Perouz Taslakian · Spandana Gella · Sai Rajeswar Mudumba

Project Page [ OpenReview]

Abstract

Autonomous agents that navigate Graphical User Interfaces (GUIs) to automate tasks like document editing and file management can greatly enhance computer workflows. While existing research focuses on online settings, desktop environments, critical for many professional and everyday tasks, remain underexplored due to data collection challenges and licensing issues. We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents in real-world desktop environments. Unlike online benchmarks, UI-Vision provides: (i) dense, high-quality annotations of human demonstrations, including bounding boxes, UI labels, and action trajectories (clicks, drags, and keyboard inputs) across 83 software applications, and (ii) three fine-to-coarse grained tasks—Element Grounding, Layout Grounding, and Action Prediction—with well-defined metrics to rigorously evaluate agents' performance in desktop environments. Our evaluation reveals critical limitations in state-of-the-art models like UI-TARS-72B, including issues with understanding professional software, spatial reasoning, and complex actions like drag-and-drop. These findings highlight the challenges in developing fully autonomous computer-use agents. With UI-Vision, we aim to advance the development of more capable agents for real-world desktop tasks.

Lay Summary

Desktop graphical interfaces (GUIs)—like those used for software applications—are central to how we perform daily tasks such as editing documents or managing files. Yet, automating these desktop tasks with artificial intelligence remains difficult, mainly due to challenges in understanding complex visual information and interactions that users regularly navigate. To address this, we developed UI-Vision, a large-scale dataset that captures detailed interactions with 83 popular desktop software applications. It includes thousands of carefully annotated examples showing how humans interact with these interfaces, such as clicking, dragging, and typing. Our dataset provides benchmarks to assess how well AI models understand and interact with desktop GUIs. Evaluations using UI-Vision reveal significant limitations in existing state-of-the-art AI models, particularly when tasks require understanding professional software tools or performing complex actions like dragging and dropping. By clearly identifying these challenges, UI-Vision helps guide future improvements in AI systems designed to automate and enhance everyday computer use.

Video

Chat is not available.