Skip to main content
Imitation learning is the problem of learning a policy from expert demonstrations rather than from a reward signal. Instead of exploring an environment through trial and error (reinforcement learning), the agent observes an expert performing a task and learns to replicate that behavior. This is how most robot learning begins in practice: a human teleoperates a robot arm through a pick-and-place task, or drives a car around a track, and the recorded observation-action pairs become training data. The appeal is obvious — you skip the reward design problem entirely and learn directly from what “good behavior” looks like.

Why start here

Imitation learning is the simplest entry point into policy learning for physical systems. It requires no reward function, no environment model, and no multi-step planning. You collect data, train a supervised model, and deploy. This simplicity makes it the right place to build intuition before tackling world models and full VLA architectures later in the course. It is also where you will encounter the most important failure mode in robot learning: distribution shift. Understanding why imitation learning fails — and how to fix it — is essential context for everything that follows.

Behavioral cloning

The simplest form of imitation learning is behavioral cloning (BC): treat the expert’s demonstrations as a supervised learning dataset and train a neural network to map observations to actions. Given a dataset of (observation, action) pairs collected from an expert, minimize the prediction error: L(θ)=E[(πθ(o)a)2]\mathcal{L}(\theta) = \mathbb{E}\left[\left(\pi_\theta(\mathbf{o}) - \mathbf{a}^*\right)^2\right] where πθ\pi_\theta is the learned policy and a\mathbf{a}^* is the expert’s action. BC works when the learned policy stays close to the expert’s trajectory. The problem is what happens when it doesn’t.

Distribution shift and compounding error

At training time, the policy sees observations from the expert’s trajectory. At test time, the policy’s own predictions determine the next observation. A small error at one step pushes the agent to a slightly different state — one the expert may never have visited. In that unfamiliar state the policy makes a worse prediction, drifting further from the expert’s trajectory, which leads to even more unfamiliar states. Errors compound quadratically with the time horizon. This is covariate shift applied to sequential decision-making: the test-time input distribution (states visited by the learner) diverges from the training distribution (states visited by the expert).

DAgger: learning from your own mistakes

DAgger (Dataset Aggregation) addresses distribution shift by iteratively collecting data from the learner’s trajectory, labeled by the expert. The algorithm:
  1. Train an initial BC policy on expert demonstrations
  2. Roll out the learner’s policy in the environment
  3. Ask the expert to label the states the learner actually visited (what would you have done here?)
  4. Add this new data to the training set
  5. Retrain and repeat
By training on states the learner visits rather than states the expert visits, the policy learns to handle its own mistakes. The training distribution converges toward the test-time distribution. A complete end-to-end implementation — collecting expert rollouts, training a BC policy, observing distribution shift, and then running DAgger rounds against the same expert — is in the behavioral cloning tutorial on CarRacing-v3. The dagger_round() function there is the algorithm above translated directly into PyTorch.

In this section

Behavioral Cloning

Hands-on tutorial: train a BC policy on CarRacing-v3, observe compounding errors, and fix them with DAgger.

NVIDIA PilotNet (Legacy)

Historical case study: end-to-end driving with the NVIDIA PilotNet architecture and Udacity simulator (2016).