VLA Agents

Vision-Language-Action (VLA) agents are multimodal systems that perceive the world through cameras and language instructions, reason over both modalities jointly, and output motor commands or navigation goals. They represent the current frontier of embodied AI, unifying the advances from large language models and vision transformers with robot learning.

Core components

A VLA agent integrates three capabilities:

Vision — image or video encoders (Vision Transformers, CLIP) that produce spatial and semantic representations of the robot’s environment
Language — transformer-based language encoders that interpret task instructions, goals, and constraints expressed in natural language
Action — policy heads that map the joint vision-language embedding to robot actions: end-effector poses, joint velocities, navigation waypoints, or discrete manipulation primitives

Key architectures

Model	Publisher	Core idea
RT-1	Google	Tokenize actions; train a Transformer on 130k robot demonstrations
RT-2	Google	Fine-tune a vision-language model (PaLI-X) end-to-end on robot data
SayCan	Google	Ground LLM task plans in robot skill affordances
OpenVLA	Stanford / Berkeley	Open-source VLA based on Llama 2 + DINOv2
π₀ (Pi-zero)	Physical Intelligence	Flow-matching action head on a VLM backbone

Pretraining and grounding

VLA models are typically built in two stages:

Vision-language pretraining — the backbone (e.g., LLaVA, PaLI, InstructBLIP) is pretrained on internet-scale image-text data, giving it broad semantic and spatial understanding
Robot action fine-tuning — the pretrained model is fine-tuned on robot demonstration datasets (Open X-Embodiment, BridgeData V2) to learn action prediction while retaining language grounding

The central challenge is grounding: ensuring that language concepts (“pick up the red cup”) map reliably to the correct visual regions and robot motions in diverse real-world conditions. Key references: (Paul et al., 2018; Walsman et al., 2018)

References

Paul, R., Barbu, A., Felshin, S., Katz, B., Roy, N. (2018). Temporal Grounding Graphs for Language Understanding with Accrued Visual-Linguistic Context.
Walsman, A., Bisk, Y., Gabriel, S., Misra, D., Artzi, Y., et al. (2018). Early Fusion for Goal Directed Robotic Vision.

Edit this page on GitHub or file an issue.

Gazebo Simulation

Sim2Real Transfer

VLA Models

Core components

Key architectures

Pretraining and grounding

References

Gazebo Simulation

Sim2Real Transfer

VLA Models

​Core components

​Key architectures

​Pretraining and grounding

​References

Core components

Key architectures

Pretraining and grounding

References