The Seminal World Model

The original World Models paper by Ha & Schmidhuber (2018) is the seminal demonstration that an agent can learn to act inside its own learned simulator of the environment. It is the conceptual reference point for almost every modern world-model architecture, and the cleanest place to build intuition.

World model agent driving in CarRacing, the agent navigates based on its learned internal model of the environment

The core architecture

The agent is decomposed into three components: V, Vision model (VAE). A variational autoencoder compresses each high-dimensional observation (e.g., a 64x64 RGB frame from CarRacing) into a compact latent vector z. This is the agent’s “visual perception”, it reduces a 12,288-dimensional pixel input to ~32 latent dimensions while preserving the information needed for control. M, Memory model (MDN-RNN). A recurrent neural network with a mixture density output predicts the next latent state given the current latent state and action. This is the world model proper, it captures the environment’s dynamics in latent space. The mixture density network (MDN) output models uncertainty: the future is not deterministic, so the model predicts a distribution over possible next states. C, Controller. A small linear controller maps the current latent state z and the RNN hidden state h to an action. Because V and M have already compressed the observation and learned the dynamics, the controller can be very simple, often just a single linear layer optimized with evolutionary strategies (CMA-ES).

Why this decomposition works

The architecture’s value comes from separation of concerns:

V handles dimensionality reduction, the controller never sees raw pixels
M handles temporal prediction, the controller doesn’t need to learn dynamics
C handles action selection, it operates in a low-dimensional, temporally structured space

This means the controller can be trained inside the world model without any environment interaction. Generate dream trajectories by rolling out M from a starting state, and optimize C against those dream trajectories. The agent learns to drive by dreaming about driving.

The CarRacing experiment

The original paper demonstrates the approach on CarRacing, the same environment used in the behavioral cloning tutorial, so you can run all three approaches (BC, PPO, world model) on the same task and compare them directly. The training pipeline:

Collect data. Run a random or partially trained policy in CarRacing to collect 10,000 frames of (observation, action, next_observation) tuples.
Train V. Train the VAE on the collected frames to learn the latent representation.
Train M. Train the MDN-RNN on sequences of (z, action, z_next) to learn the dynamics model.
Train C. Using CMA-ES, evolve the controller parameters by evaluating candidate controllers inside M’s dream rollouts. No environment interaction needed.
Deploy. Run V + M + C in the real environment: observe → encode → predict → act.

​The core architecture

​Why this decomposition works

​The CarRacing experiment

​Further reading

The core architecture

Why this decomposition works

The CarRacing experiment

Further reading