Computer-Using Agents

A computer-using agent (CUA) is the right framing when the demonstrations are screen recordings of a human using a desktop or web application. The observation is screen pixels and the action is mixed-modal tokenized output — clicks at (x, y), key presses, typed text, scrolls — rather than the continuous low-DOF action vector you would emit for a robot arm. RPA-style automation, autonomous browser agents, and the CUA PDF Reader course project all fall here. This is its own subfield with its own data formats and policy architectures. Classical imitation-learning libraries assume Box/Discrete action spaces and Gymnasium environments — they cannot ingest screen frames or emit tokenized GUI action streams. The visuomotor robotics stack (Diffusion Policy, ACT, LeRobot) handles high-resolution image observations well, but its policy heads are designed for continuous, low-DOF, fixed-shape action vectors — not the discrete-categorical clicks-plus-coordinates-plus-text actions a GUI agent must produce. The CUA stack below is where to start.

Foundation models and datasets

OpenCUA (XLang AI) — open-weight 7B / 32B computer-use agents trained on the AgentNet dataset of >22k human demonstrations, with reflective long chain-of-thought reasoning
OS-Atlas, UI-TARS, ShowUI, Magma, GUI-Owl — adjacent open VLA models for GUI grounding and agentic browser/desktop use

Datasets

Khang-9966/Computer-Browser-Phone-Use-Agent-Datasets — curated index of datasets across the whole computer / browser / phone use-agent space; the first public dataset index for CUA browser control

Evaluation harnesses

OSWorld — desktop computer-use agent benchmark
Mind2Web — web-agent benchmark across hundreds of real websites

Labs

Browser-control GRPO (LFM2-VL-450M)

GRPO fine-tuning of a 450M VLM for browser control. Reward design, vLLM-collocated rollouts, and the full RL training loop.

Edit this page on GitHub or file an issue.

VLM post-training with GRPO (Qwen3-VL-2B)

Browser control with GRPO, fine-tuning LFM2-350M

LLM Reasoning

LLM Reasoning Lab

VLM Reasoning Lab