World Models - aegean.ai

Humans build an internal mental model of the world and routinely use simulation to update their expectations. When you catch a ball, you are not reacting to each frame of visual input, you are predicting the ball’s trajectory in a learned internal model and planning your arm’s motion against that prediction. World models bring this idea to machine learning: instead of learning a policy directly from environment interactions (model-free RL), the agent first learns a compressed, predictive model of the environment and then trains a controller inside that model. The agent literally learns to act inside its own dream.

What a world model is

At its most general, a world model is any learned function

(observation history, action) → next observation (or distribution over next observations)

that the agent can roll out forward in time without touching the real environment. Once you have such a function, you can:

Plan by searching over hypothetical action sequences
Train a policy entirely on imagined rollouts (no real-world samples needed)
Estimate uncertainty by sampling multiple futures from the model

This is the deep-learning incarnation of model-based reinforcement learning. Classical model-based RL hand-coded the transition function (Dyna, MPC); modern world models learn it directly from pixels.

A family of architectures

World models have evolved through several generative-modeling backbones. Each generation kept the core idea, learn an internal simulator, train a controller inside it, but improved the fidelity and scope of the dream.

Generation	Vision backbone	Example	Strength
Latent VAE + RNN	VAE	Ha & Schmidhuber (2018)	Clean decomposition, easy to train, low compute
Recurrent latent dynamics	VAE	Dreamer / DreamerV3	Single architecture works across Atari, DMControl, Minecraft
Diffusion-based	Diffusion model	DIAMOND, Genie 2	Photorealistic next-frame prediction, the dream looks like reality
Autoregressive token	Transformer	GAIA-1	Multi-modal conditioning, supports text-conditioned scenario generation
Joint-Embedding Predictive (JEPA)	ViT (JEPA)	I-JEPA, V-JEPA	Predicts in representation space, ignores irrelevant pixel detail, scales to video and language

The original VAE-based architecture has blurry reconstructions and a fixed latent dimensionality, but it remains the cleanest place to build intuition before tackling the larger variants. Diffusion and autoregressive models achieve dramatically higher visual fidelity at the cost of much larger parameter counts and slower rollouts. The JEPA family takes a different approach entirely: instead of reconstructing pixels or tokens, it predicts representations of future observations, letting the model focus on semantically meaningful structure.

World models vs other approaches

It helps to put world models next to the alternatives you have already seen for learning a control policy. Each approach has a different failure mode:

Approach	How it learns	Failure mode
Behavioral cloning	Supervised regression on expert demonstrations (where the expert is itself produced by an RL algorithm such as PPO, or by human teleoperation)	Distribution shift, compounds errors on unseen states
World model	Learn a model of the environment, then train a controller in the dream	Model inaccuracy, the dream may diverge from reality

How world models connect to the rest of Physical AI

World models sit at the intersection of several threads in this course: Sim-to-real. A world model trained on real data is a simulator, one that is automatically calibrated to reality because it was learned from real observations. This eliminates the hand-authored sim gap discussed in the sim-to-real transfer page. Instead of building Gazebo worlds and hoping they match reality, you learn a world model from a few minutes of real video and train inside it. Behavioral cloning vs world models. BC treats the expert’s demonstrations as the only source of data. World models treat the environment’s dynamics as learnable, the agent can extrapolate beyond the demonstrations by imagining what would happen in unseen states. This is why world models partially address BC’s distribution-shift problem: the agent has dreamed about a wider distribution of states than any fixed demonstration dataset provides. VLA models. Most current VLA architectures like OpenVLA and RT-2 are pure behavior-cloning systems with no explicit world model. Combining them with learned dynamics models is an active research direction precisely because doing so could close the BC-vs-world-model gap above.

In this section

The Seminal Model

Ha & Schmidhuber’s original V/M/C architecture, VAE for vision, MDN-RNN for memory, tiny linear controller trained inside the dream.

The JEPA family

LeCun’s roadmap for world models, predict representations, not pixels. From the 2022 position paper through I-JEPA, MC-JEPA, V-JEPA, VL-JEPA, H-JEPA, and LeJEPA.

​What a world model is

​A family of architectures

​World models vs other approaches

​How world models connect to the rest of Physical AI

​In this section

The Seminal Model

The JEPA family

​Further reading

What a world model is

A family of architectures

World models vs other approaches

How world models connect to the rest of Physical AI

In this section

Further reading