Skip to main content
Humans build an internal mental model of the world and routinely use simulation to update their expectations. When you catch a ball, you are not reacting to each frame of visual input — you are predicting the ball’s trajectory in a learned internal model and planning your arm’s motion against that prediction. World models bring this idea to machine learning: instead of learning a policy directly from environment interactions (model-free RL), the agent first learns a compressed, predictive model of the environment and then trains a controller inside that model. The agent literally learns to act inside its own dream.

What a world model is

At its most general, a world model is any learned function
(observation history, action) → next observation (or distribution over next observations)
that the agent can roll out forward in time without touching the real environment. Once you have such a function, you can:
  • Plan by searching over hypothetical action sequences
  • Train a policy entirely on imagined rollouts (no real-world samples needed)
  • Estimate uncertainty by sampling multiple futures from the model
This is the deep-learning incarnation of model-based reinforcement learning. Classical model-based RL hand-coded the transition function (Dyna, MPC); modern world models learn it directly from pixels.

A family of architectures

World models have evolved through several generative-modeling backbones. Each generation kept the core idea — learn an internal simulator, train a controller inside it — but improved the fidelity and scope of the dream.
GenerationVision backboneExampleStrength
Latent VAE + RNNVAEHa & Schmidhuber (2018)Clean decomposition, easy to train, low compute
Recurrent latent dynamicsVAEDreamer / DreamerV3Single architecture works across Atari, DMControl, Minecraft
Diffusion-basedDiffusion modelDIAMOND, Genie 2Photorealistic next-frame prediction, the dream looks like reality
Autoregressive tokenTransformerGAIA-1Multi-modal conditioning, supports text-conditioned scenario generation
The original VAE-based architecture has blurry reconstructions and a fixed latent dimensionality, but it remains the cleanest place to build intuition before tackling the larger variants. Diffusion and autoregressive models achieve dramatically higher visual fidelity at the cost of much larger parameter counts and slower rollouts.

World models vs other approaches

It helps to put world models next to the alternatives you have already seen for learning a control policy. Each approach has a different failure mode:
ApproachHow it learnsFailure mode
Behavioral cloningSupervised regression on expert demonstrations (where the expert is itself produced by an RL algorithm such as PPO, or by human teleoperation)Distribution shift — compounds errors on unseen states
World modelLearn a model of the environment, then train a controller in the dreamModel inaccuracy — the dream may diverge from reality

How world models connect to the rest of Physical AI

World models sit at the intersection of several threads in this course: Sim-to-real. A world model trained on real data is a simulator — one that is automatically calibrated to reality because it was learned from real observations. This eliminates the hand-authored sim gap discussed in the sim-to-real transfer page. Instead of building Gazebo worlds and hoping they match reality, you learn a world model from a few minutes of real video and train inside it. Behavioral cloning vs world models. BC treats the expert’s demonstrations as the only source of data. World models treat the environment’s dynamics as learnable — the agent can extrapolate beyond the demonstrations by imagining what would happen in unseen states. This is why world models partially address BC’s distribution-shift problem: the agent has dreamed about a wider distribution of states than any fixed demonstration dataset provides. VLA models. Most current VLA architectures like OpenVLA and RT-2 are pure behavior-cloning systems with no explicit world model. Combining them with learned dynamics models is an active research direction precisely because doing so could close the BC-vs-world-model gap above.

In this section

The Seminal Model

Ha & Schmidhuber’s original V/M/C architecture — VAE for vision, MDN-RNN for memory, tiny linear controller trained inside the dream.

Further reading