Skip to main content
A computer-using agent (CUA) is the right framing when the demonstrations are screen recordings of a human using a desktop or web application. The observation is screen pixels and the action is mixed-modal tokenized output — clicks at (x, y), key presses, typed text, scrolls — rather than the continuous low-DOF action vector you would emit for a robot arm. RPA-style automation, autonomous browser agents, and the CUA PDF Reader course project all fall here. This is its own subfield with its own data formats and policy architectures. Classical imitation-learning libraries assume Box/Discrete action spaces and Gymnasium environments — they cannot ingest screen frames or emit tokenized GUI action streams. The visuomotor robotics stack (Diffusion Policy, ACT, LeRobot) handles high-resolution image observations well, but its policy heads are designed for continuous, low-DOF, fixed-shape action vectors — not the discrete-categorical clicks-plus-coordinates-plus-text actions a GUI agent must produce. The CUA stack below is where to start.

Foundation models and datasets

  • OpenCUA (XLang AI) — open-weight 7B / 32B computer-use agents trained on the AgentNet dataset of >22k human demonstrations, with reflective long chain-of-thought reasoning
  • OS-Atlas, UI-TARS, ShowUI, Magma, GUI-Owl — adjacent open VLA models for GUI grounding and agentic browser/desktop use

Datasets

Evaluation harnesses

  • OSWorld — desktop computer-use agent benchmark
  • Mind2Web — web-agent benchmark across hundreds of real websites

Labs

Browser-control GRPO (LFM2-VL-450M)

GRPO fine-tuning of a 450M VLM for browser control. Reward design, vLLM-collocated rollouts, and the full RL training loop.