OpenVLA - aegean.ai

OpenVLA is the most influential open-source Vision-Language-Action model released to date. When it was first presented to the community, the project prompted a “record number of questions by far for any of the robot paper discussions” hosted by Hugging Face, a measure of how much excitement it generated among researchers.

Its popularity stems from filling a critical gap. Before OpenVLA, the most capable VLA models, such as Google’s RT-2, were closed-source. Other open models were trained entirely on simulated data and did not generalize to real robots out of the box. OpenVLA gave the community the first powerful, fully open-source generalist manipulation policy.

Main points

Fully open-source and high capacity

OpenVLA is a 7-billion parameter model built on a Prismatic VLM backbone, fusing LLaMA 2 with DINOv2 and SigLIP vision encoders. The researchers publicly released:

All pre-training and fine-tuning code
Model weights
Data mixtures used for training

This level of openness, code, weights, and data, is what makes OpenVLA reproducible and extensible by the broader research community.

Massive real-world pre-training

OpenVLA is trained on nearly 1 million real-world robot episodes from the Open X-Embodiment dataset, encompassing 27 different robotic datasets. This breadth allows it to control a variety of robots out of the box, including:

WidowX
Google RTX
Franka Panda

The use of real (not simulated) demonstration data is a major reason it transfers well to physical hardware.

State-of-the-art performance

When deployed out of the box, OpenVLA outperforms prior open-source models such as Octo and RT-1X, and even performs comparably to or better than the 55-billion parameter closed-source RT-2X in most task categories. On average, it achieves a 20% higher absolute success rate. OpenVLA is particularly strong at:

Language grounding, mapping instructions to the correct visual referents
Multi-instruction tasks with distractor objects, staying on-task in cluttered scenes

Trained via next-token prediction

OpenVLA treats robotic control as a classification problem, exactly like a text-based LLM:

The robot’s 7-dimensional continuous action space (position, rotation, gripper state) is discretized into 255 uniform bins
The model predicts physical actions as standard text tokens using cross-entropy loss
No architectural modification of the underlying VLM is required

This is the same training paradigm as a language model, the only change is that some of the “tokens” in the vocabulary now represent robot actions.

Highly accessible for low-compute budgets

You do not need a server cluster to use OpenVLA:

Parameter-Efficient Fine-Tuning (PEFT) with LoRA, match full fine-tuning performance by training only 1.4% of the model’s parameters
4-bit quantization, load and run the model on just 7 GB of GPU VRAM (down from 16 GB) with no observed performance degradation

Together, these make OpenVLA accessible on consumer-grade GPUs, which is unusual for a 7B-parameter foundation model.

Current limitations

Because OpenVLA is a large autoregressive model, it has several constraints in its current form:

Single-frame inputs only, no temporal context across multiple frames
Single-step action prediction, predicts one action at a time, no action chunking
Inference speed, caps at roughly 3-9 Hz depending on hardware

These limitations make OpenVLA currently unsuitable for high-frequency control tasks or complex bimanual manipulation without further optimizations such as action chunking, distillation, or more efficient inference backends.

​Main points

​Fully open-source and high capacity

​Massive real-world pre-training

​State-of-the-art performance

​Trained via next-token prediction

​Highly accessible for low-compute budgets

​Current limitations

​Further reading

Main points

Fully open-source and high capacity

Massive real-world pre-training

State-of-the-art performance

Trained via next-token prediction

Highly accessible for low-compute budgets

Current limitations

Further reading