Its popularity stems from filling a critical gap. Before OpenVLA, the most capable VLA models — such as Google’s RT-2 — were closed-source. Other open models were trained entirely on simulated data and did not generalize to real robots out of the box. OpenVLA gave the community the first powerful, fully open-source generalist manipulation policy.
Main points
Fully open-source and high capacity
OpenVLA is a 7-billion parameter model built on a Prismatic VLM backbone, fusing LLaMA 2 with DINOv2 and SigLIP vision encoders. The researchers publicly released:- All pre-training and fine-tuning code
- Model weights
- Data mixtures used for training
Massive real-world pre-training
OpenVLA is trained on nearly 1 million real-world robot episodes from the Open X-Embodiment dataset, encompassing 27 different robotic datasets. This breadth allows it to control a variety of robots out of the box, including:- WidowX
- Google RTX
- Franka Panda
State-of-the-art performance
When deployed out of the box, OpenVLA outperforms prior open-source models such as Octo and RT-1X, and even performs comparably to or better than the 55-billion parameter closed-source RT-2X in most task categories. On average, it achieves a 20% higher absolute success rate. OpenVLA is particularly strong at:- Language grounding — mapping instructions to the correct visual referents
- Multi-instruction tasks with distractor objects — staying on-task in cluttered scenes
Trained via next-token prediction
OpenVLA treats robotic control as a classification problem, exactly like a text-based LLM:- The robot’s 7-dimensional continuous action space (position, rotation, gripper state) is discretized into 255 uniform bins
- The model predicts physical actions as standard text tokens using cross-entropy loss
- No architectural modification of the underlying VLM is required
Highly accessible for low-compute budgets
You do not need a server cluster to use OpenVLA:- Parameter-Efficient Fine-Tuning (PEFT) with LoRA — match full fine-tuning performance by training only 1.4% of the model’s parameters
- 4-bit quantization — load and run the model on just 7 GB of GPU VRAM (down from 16 GB) with no observed performance degradation
Current limitations
Because OpenVLA is a large autoregressive model, it has several constraints in its current form:- Single-frame inputs only — no temporal context across multiple frames
- Single-step action prediction — predicts one action at a time, no action chunking
- Inference speed — caps at roughly 3-9 Hz depending on hardware
Where this fits in the course
OpenVLA is a concrete instance of the VLA architecture pattern: a pretrained vision-language model fine-tuned on robot demonstrations using behavior cloning. As discussed in the Physical AI overview, this puts it squarely in the imitation-learning family — and it inherits both the strengths (broad generalization from web-scale pretraining) and the weaknesses (distribution-shift fragility) you saw in the behavioral cloning tutorial.Further reading
- OpenVLA project page — paper, code, models, demos
- Kim et al. (2024). OpenVLA: An Open-Source Vision-Language-Action Model
- Open X-Embodiment dataset — the training corpus
- Prismatic VLMs — the backbone architecture

