(x, y), key presses, typed text, scrolls — rather than the continuous low-DOF action vector you would emit for a robot arm. RPA-style automation, autonomous browser agents, and the CUA PDF Reader course project all fall here.
This is its own subfield with its own data formats and policy architectures. Classical imitation-learning libraries assume Box/Discrete action spaces and Gymnasium environments — they cannot ingest screen frames or emit tokenized GUI action streams. The visuomotor robotics stack (Diffusion Policy, ACT, LeRobot) handles high-resolution image observations well, but its policy heads are designed for continuous, low-DOF, fixed-shape action vectors — not the discrete-categorical clicks-plus-coordinates-plus-text actions a GUI agent must produce. The CUA stack below is where to start.
Foundation models and datasets
- OpenCUA (XLang AI) — open-weight 7B / 32B computer-use agents trained on the AgentNet dataset of >22k human demonstrations, with reflective long chain-of-thought reasoning
- OS-Atlas, UI-TARS, ShowUI, Magma, GUI-Owl — adjacent open VLA models for GUI grounding and agentic browser/desktop use
Datasets
- Khang-9966/Computer-Browser-Phone-Use-Agent-Datasets — curated index of datasets across the whole computer / browser / phone use-agent space; the first public dataset index for CUA browser control
Evaluation harnesses
- OSWorld — desktop computer-use agent benchmark
- Mind2Web — web-agent benchmark across hundreds of real websites
Labs
Browser-control GRPO (LFM2-VL-450M)
GRPO fine-tuning of a 450M VLM for browser control. Reward design, vLLM-collocated rollouts, and the full RL training loop.

