Imitation Learning

Imitation learning is the problem of learning a policy from expert demonstrations rather than from a reward signal. Instead of exploring an environment through trial and error (reinforcement learning), the agent observes an expert performing a task and learns to replicate that behavior. This is how most robot learning begins in practice: a human teleoperates a robot arm through a pick-and-place task, or drives a car around a track, and the recorded observation-action pairs become training data. The appeal is obvious, you skip the reward design problem entirely and learn directly from what “good behavior” looks like.

Why start here

Imitation learning is the simplest entry point into policy learning for physical systems. It requires no reward function, no environment model, and no multi-step planning. You collect data, train a supervised model, and deploy. This simplicity makes it the right place to build intuition before tackling world models and full VLA architectures later in the course. It is also where you will encounter the most important failure mode in robot learning: distribution shift. Understanding why imitation learning fails, and how to fix it, is essential context for everything that follows.

Behavioral cloning

The simplest form of imitation learning is behavioral cloning (BC): treat the expert’s demonstrations as a supervised learning dataset and train a neural network to map observations to actions. Given a dataset of (observation, action) pairs collected from an expert, minimize the prediction error:

\mathcal{L}(\theta) = \mathbb{E}\left[\left(\pi_\theta(\mathbf{o}) - \mathbf{a}^*\right)^2\right]

where

\pi_\theta

is the learned policy and

\mathbf{a}^*

is the expert’s action. BC works when the learned policy stays close to the expert’s trajectory. The problem is what happens when it doesn’t.

Distribution shift and compounding error

At training time, the policy sees observations from the expert’s trajectory. At test time, the policy’s own predictions determine the next observation. A small error at one step pushes the agent to a slightly different state, one the expert may never have visited. In that unfamiliar state the policy makes a worse prediction, drifting further from the expert’s trajectory, which leads to even more unfamiliar states. Errors compound quadratically with the time horizon. This is covariate shift applied to sequential decision-making: the test-time input distribution (states visited by the learner) diverges from the training distribution (states visited by the expert).

DAgger: learning from your own mistakes

DAgger (Dataset Aggregation) addresses distribution shift by iteratively collecting data from the learner’s trajectory, labeled by the expert. The algorithm:

Train an initial BC policy on expert demonstrations
Roll out the learner’s policy in the environment
Ask the expert to label the states the learner actually visited (what would you have done here?)
Add this new data to the training set
Retrain and repeat

By training on states the learner visits rather than states the expert visits, the policy learns to handle its own mistakes. The training distribution converges toward the test-time distribution. A complete end-to-end implementation, collecting expert rollouts, training a BC policy, observing distribution shift, and then running DAgger rounds against the same expert, is in the behavioral cloning tutorial on CarRacing-v3.

In this section

Behavioral Cloning

Hands-on tutorial: train a BC policy on CarRacing-v3, observe compounding errors, and fix them with DAgger.

NVIDIA PilotNet (Legacy)

Historical case study: end-to-end driving with the NVIDIA PilotNet architecture and Udacity simulator (2016).

References

imitation — PyTorch implementations of BC, DAgger, GAIL, AIRL, and preference-based IL on top of Stable-Baselines3 and Gymnasium (HumanCompatibleAI, UC Berkeley)
Pomerleau. ALVINN: An Autonomous Land Vehicle in a Neural Network. NeurIPS 1988
Ross, Gordon, Bagnell. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. AISTATS 2011. arxiv.org/abs/1011.0686
Ho, Ermon. Generative Adversarial Imitation Learning. NeurIPS 2016. arxiv.org/abs/1606.03476
Bojarski et al. End to End Learning for Self-Driving Cars. arXiv 2016. arxiv.org/abs/1604.07316

Edit this page on GitHub or file an issue.

​Why start here

​Behavioral cloning

​Distribution shift and compounding error

​DAgger: learning from your own mistakes

​In this section

Behavioral Cloning

NVIDIA PilotNet (Legacy)

​References

Why start here

Behavioral cloning

Distribution shift and compounding error

DAgger: learning from your own mistakes

In this section

References