Core components
A VLA agent integrates three capabilities:- Vision — image or video encoders (Vision Transformers, CLIP) that produce spatial and semantic representations of the robot’s environment
- Language — transformer-based language encoders that interpret task instructions, goals, and constraints expressed in natural language
- Action — policy heads that map the joint vision-language embedding to robot actions: end-effector poses, joint velocities, navigation waypoints, or discrete manipulation primitives
Key architectures
| Model | Publisher | Core idea |
|---|---|---|
| RT-1 | Tokenize actions; train a Transformer on 130k robot demonstrations | |
| RT-2 | Fine-tune a vision-language model (PaLI-X) end-to-end on robot data | |
| SayCan | Ground LLM task plans in robot skill affordances | |
| OpenVLA | Stanford / Berkeley | Open-source VLA based on Llama 2 + DINOv2 |
| π₀ (Pi-zero) | Physical Intelligence | Flow-matching action head on a VLM backbone |
Pretraining and grounding
VLA models are typically built in two stages:- Vision-language pretraining — the backbone (e.g., LLaVA, PaLI, InstructBLIP) is pretrained on internet-scale image-text data, giving it broad semantic and spatial understanding
- Robot action fine-tuning — the pretrained model is fine-tuned on robot demonstration datasets (Open X-Embodiment, BridgeData V2) to learn action prediction while retaining language grounding
References
- Paul, R., Barbu, A., Felshin, S., Katz, B., Roy, N. (2018). Temporal Grounding Graphs for Language Understanding with Accrued Visual-Linguistic Context.
- Walsman, A., Bisk, Y., Gabriel, S., Misra, D., Artzi, Y., et al. (2018). Early Fusion for Goal Directed Robotic Vision.

