Its popularity stems from filling a critical gap. Before OpenVLA, the most capable VLA models, such as Google’s RT-2, were closed-source. Other open models were trained entirely on simulated data and did not generalize to real robots out of the box. OpenVLA gave the community the first powerful, fully open-source generalist manipulation policy.
Main points
Fully open-source and high capacity
OpenVLA is a 7-billion parameter model built on a Prismatic VLM backbone, fusing LLaMA 2 with DINOv2 and SigLIP vision encoders. The researchers publicly released:- All pre-training and fine-tuning code
- Model weights
- Data mixtures used for training
Massive real-world pre-training
OpenVLA is trained on nearly 1 million real-world robot episodes from the Open X-Embodiment dataset, encompassing 27 different robotic datasets. This breadth allows it to control a variety of robots out of the box, including:- WidowX
- Google RTX
- Franka Panda
State-of-the-art performance
When deployed out of the box, OpenVLA outperforms prior open-source models such as Octo and RT-1X, and even performs comparably to or better than the 55-billion parameter closed-source RT-2X in most task categories. On average, it achieves a 20% higher absolute success rate. OpenVLA is particularly strong at:- Language grounding, mapping instructions to the correct visual referents
- Multi-instruction tasks with distractor objects, staying on-task in cluttered scenes
Trained via next-token prediction
OpenVLA treats robotic control as a classification problem, exactly like a text-based LLM:- The robot’s 7-dimensional continuous action space (position, rotation, gripper state) is discretized into 255 uniform bins
- The model predicts physical actions as standard text tokens using cross-entropy loss
- No architectural modification of the underlying VLM is required
Highly accessible for low-compute budgets
You do not need a server cluster to use OpenVLA:- Parameter-Efficient Fine-Tuning (PEFT) with LoRA, match full fine-tuning performance by training only 1.4% of the model’s parameters
- 4-bit quantization, load and run the model on just 7 GB of GPU VRAM (down from 16 GB) with no observed performance degradation
Current limitations
Because OpenVLA is a large autoregressive model, it has several constraints in its current form:- Single-frame inputs only, no temporal context across multiple frames
- Single-step action prediction, predicts one action at a time, no action chunking
- Inference speed, caps at roughly 3-9 Hz depending on hardware
Further reading
- OpenVLA project page, paper, code, models, demos
- Kim et al. (2024). OpenVLA: An Open-Source Vision-Language-Action Model
- Open X-Embodiment dataset, the training corpus
- Prismatic VLMs, the backbone architecture

