What a world model is
At its most general, a world model is any learned function- Plan by searching over hypothetical action sequences
- Train a policy entirely on imagined rollouts (no real-world samples needed)
- Estimate uncertainty by sampling multiple futures from the model
A family of architectures
World models have evolved through several generative-modeling backbones. Each generation kept the core idea — learn an internal simulator, train a controller inside it — but improved the fidelity and scope of the dream.| Generation | Vision backbone | Example | Strength |
|---|---|---|---|
| Latent VAE + RNN | VAE | Ha & Schmidhuber (2018) | Clean decomposition, easy to train, low compute |
| Recurrent latent dynamics | VAE | Dreamer / DreamerV3 | Single architecture works across Atari, DMControl, Minecraft |
| Diffusion-based | Diffusion model | DIAMOND, Genie 2 | Photorealistic next-frame prediction, the dream looks like reality |
| Autoregressive token | Transformer | GAIA-1 | Multi-modal conditioning, supports text-conditioned scenario generation |
World models vs other approaches
It helps to put world models next to the alternatives you have already seen for learning a control policy. Each approach has a different failure mode:| Approach | How it learns | Failure mode |
|---|---|---|
| Behavioral cloning | Supervised regression on expert demonstrations (where the expert is itself produced by an RL algorithm such as PPO, or by human teleoperation) | Distribution shift — compounds errors on unseen states |
| World model | Learn a model of the environment, then train a controller in the dream | Model inaccuracy — the dream may diverge from reality |
How world models connect to the rest of Physical AI
World models sit at the intersection of several threads in this course: Sim-to-real. A world model trained on real data is a simulator — one that is automatically calibrated to reality because it was learned from real observations. This eliminates the hand-authored sim gap discussed in the sim-to-real transfer page. Instead of building Gazebo worlds and hoping they match reality, you learn a world model from a few minutes of real video and train inside it. Behavioral cloning vs world models. BC treats the expert’s demonstrations as the only source of data. World models treat the environment’s dynamics as learnable — the agent can extrapolate beyond the demonstrations by imagining what would happen in unseen states. This is why world models partially address BC’s distribution-shift problem: the agent has dreamed about a wider distribution of states than any fixed demonstration dataset provides. VLA models. Most current VLA architectures like OpenVLA and RT-2 are pure behavior-cloning systems with no explicit world model. Combining them with learned dynamics models is an active research direction precisely because doing so could close the BC-vs-world-model gap above.In this section
The Seminal Model
Ha & Schmidhuber’s original V/M/C architecture — VAE for vision, MDN-RNN for memory, tiny linear controller trained inside the dream.
Further reading
- Ha & Schmidhuber (2018). World Models — the seminal paper
- Hafner et al. (2020). Dream to Control: Learning Behaviors by Latent Imagination — Dreamer, with actor-critic training in latent space
- Hafner et al. (2023). Mastering Diverse Domains through World Models — DreamerV3, the same architecture across Atari, DMControl, and Minecraft
- Alonso et al. (2024). Diffusion for World Modeling: Visual Details Matter in Atari — DIAMOND
- Hu et al. (2023). GAIA-1: A Generative World Model for Autonomous Driving

