Skip to main content
OpenVLA is the most influential open-source Vision-Language-Action model released to date. When it was first presented to the community, the project prompted a “record number of questions by far for any of the robot paper discussions” hosted by Hugging Face — a measure of how much excitement it generated among researchers.
Its popularity stems from filling a critical gap. Before OpenVLA, the most capable VLA models — such as Google’s RT-2 — were closed-source. Other open models were trained entirely on simulated data and did not generalize to real robots out of the box. OpenVLA gave the community the first powerful, fully open-source generalist manipulation policy.

Main points

Fully open-source and high capacity

OpenVLA is a 7-billion parameter model built on a Prismatic VLM backbone, fusing LLaMA 2 with DINOv2 and SigLIP vision encoders. The researchers publicly released:
  • All pre-training and fine-tuning code
  • Model weights
  • Data mixtures used for training
This level of openness — code, weights, and data — is what makes OpenVLA reproducible and extensible by the broader research community.

Massive real-world pre-training

OpenVLA is trained on nearly 1 million real-world robot episodes from the Open X-Embodiment dataset, encompassing 27 different robotic datasets. This breadth allows it to control a variety of robots out of the box, including:
  • WidowX
  • Google RTX
  • Franka Panda
The use of real (not simulated) demonstration data is a major reason it transfers well to physical hardware.

State-of-the-art performance

When deployed out of the box, OpenVLA outperforms prior open-source models such as Octo and RT-1X, and even performs comparably to or better than the 55-billion parameter closed-source RT-2X in most task categories. On average, it achieves a 20% higher absolute success rate. OpenVLA is particularly strong at:
  • Language grounding — mapping instructions to the correct visual referents
  • Multi-instruction tasks with distractor objects — staying on-task in cluttered scenes

Trained via next-token prediction

OpenVLA treats robotic control as a classification problem, exactly like a text-based LLM:
  1. The robot’s 7-dimensional continuous action space (position, rotation, gripper state) is discretized into 255 uniform bins
  2. The model predicts physical actions as standard text tokens using cross-entropy loss
  3. No architectural modification of the underlying VLM is required
This is the same training paradigm as a language model — the only change is that some of the “tokens” in the vocabulary now represent robot actions.

Highly accessible for low-compute budgets

You do not need a server cluster to use OpenVLA:
  • Parameter-Efficient Fine-Tuning (PEFT) with LoRA — match full fine-tuning performance by training only 1.4% of the model’s parameters
  • 4-bit quantization — load and run the model on just 7 GB of GPU VRAM (down from 16 GB) with no observed performance degradation
Together, these make OpenVLA accessible on consumer-grade GPUs, which is unusual for a 7B-parameter foundation model.

Current limitations

Because OpenVLA is a large autoregressive model, it has several constraints in its current form:
  • Single-frame inputs only — no temporal context across multiple frames
  • Single-step action prediction — predicts one action at a time, no action chunking
  • Inference speed — caps at roughly 3-9 Hz depending on hardware
These limitations make OpenVLA currently unsuitable for high-frequency control tasks or complex bimanual manipulation without further optimizations such as action chunking, distillation, or more efficient inference backends.

Where this fits in the course

OpenVLA is a concrete instance of the VLA architecture pattern: a pretrained vision-language model fine-tuned on robot demonstrations using behavior cloning. As discussed in the Physical AI overview, this puts it squarely in the imitation-learning family — and it inherits both the strengths (broad generalization from web-scale pretraining) and the weaknesses (distribution-shift fragility) you saw in the behavioral cloning tutorial.

Further reading