Physical AI - aegean.ai

Robotic arm on a wooden board reaching toward colored geometric objects on a table, with a stereo camera head mounted above and a tripod-mounted webcam observing the scene.

A tabletop manipulation setup: a robotic arm with an overhead stereo camera head and an external observer camera, tasked with picking and placing colored geometric objects. Physical AI represents the convergence of the AI and robotics tracks, systems that perceive the world, reason about it using language and vision, and act in physical environments. This section covers the progression from simple learning-from-demonstrations to full embodied agents that perceive, reason, and act in the real world. You will work through four stages, each building on the previous one’s limitations:

Imitation learning, you start by training a policy to copy an expert’s behavior. This works surprisingly well on easy cases, but you will see it fail when the agent drifts into states the expert never demonstrated. Understanding why it fails (distribution shift) motivates everything that follows.
World models, to address distribution shift, you learn an internal model of the environment and train a controller inside that model. This is the deep-learning incarnation of model-based reinforcement learning: instead of hand-coding the transition function as classical model-based RL did with Dyna and MPC, the agent learns it directly from pixel observations using a VAE plus an RNN (or, in modern variants, a diffusion or autoregressive transformer). The agent then “dreams” about driving, rolling out trajectories inside the learned model, before it ever touches the real environment, exposing it to a far wider distribution of states than any fixed demonstration dataset provides.
VLA models, you combine vision, language, and action into a single end-to-end architecture. Most current VLAs are trained as pure behavior-cloning systems on large robot demonstration datasets. OpenVLA, for example, is fine-tuned from a pretrained vision-language model on the Open X-Embodiment dataset using next-action prediction, there is no explicit world model or dynamics rollout in the architecture. VLAs are powerful, but they inherit the same distribution-shift fragility you saw with simple BC, which is why current research is exploring how to combine them with world models and RL fine-tuning. This is the current frontier of embodied AI.
Sim-to-real transfer, finally, you deploy your learned policy to real hardware. You will study the techniques that close the gap between simulation and reality: domain randomization, system identification, and 3D Gaussian splatting for photorealistic world generation.

Imitation Learning

The simplest way to learn: copy the expert. Train a driving policy from demonstrations, observe distribution shift, and fix it with DAgger.

World Models

Learn an internal simulator of the environment and train a controller inside the dream. Addresses distribution shift by imagining unseen states.

VLA Models

Combine vision, language, and action into end-to-end agents. Uses imitation learning and world models internally.

Sim-to-Real Transfer

The final mile: deploy learned policies to real hardware. Domain randomization, 3D Gaussian splatting, and closing the sim-to-real gap.

References

LeRobot — Hugging Face’s IL library for real robots; bundles Diffusion Policy, ACT, VQ-BeT, π0
Diffusion Policy — Chi et al., Toyota Research; visuomotor policy learning via action diffusion. arxiv.org/abs/2303.04137
ACT / ALOHA — Zhao et al.; action-chunking transformer for bimanual manipulation
RoboMimic — Stanford; offline IL benchmark suite for manipulation
OpenVLA — open vision-language-action model fine-tuned on the Open X-Embodiment dataset

Edit this page on GitHub or file an issue.

Imitation Learning

World Models

VLA Models

Sim-to-Real Transfer

​References

References