The domain gap
A policy trained in Gazebo sees flat textures, perfect lighting, and noiseless depth. A real kitchen has specular reflections, clutter, and a depth camera that hallucinates on glass surfaces. The gap shows up as:- Visual mismatch — synthetic renders look nothing like real camera images
- Dynamics mismatch — simulated friction, mass, and contact differ from the real robot
- Sensor mismatch — perfect simulated sensors vs noisy real IMUs, cameras, and lidars
Domain randomization
The brute-force approach: randomize simulation parameters (textures, lighting, object positions, camera noise, friction coefficients) so broadly that the real world falls within the training distribution. The policy learns to be invariant to visual and physical variation rather than memorizing one simulated environment. Domain randomization is simple to implement but expensive — you need enough variation to cover reality, and you cannot know in advance whether you have enough. It also tends to produce conservative policies that handle variation by being cautious rather than precise.Domain adaptation
Instead of randomizing the source domain, align the feature distributions of simulated and real data so they become indistinguishable to the policy. Techniques include:- Adversarial adaptation — train a discriminator to distinguish sim from real features, and a feature extractor that fools it
- Style transfer — render simulated images through a neural style transfer network trained on real images
- Feature matching — minimize distributional distance (MMD, CORAL) between sim and real feature spaces
System identification
Calibrate the simulator’s physics parameters to match the real robot. Measure real-world friction, damping, motor response curves, and sensor noise profiles, then set the simulator to match. This closes the dynamics gap directly rather than learning around it. System identification is most effective for dynamics-dominated tasks (locomotion, contact-rich manipulation) where visual appearance matters less than physical accuracy.3D Gaussian splatting for photorealistic world generation
The techniques above all accept a hand-authored simulation as the starting point and try to compensate for its visual poverty. A different approach: start from a 3D scan of the real environment and train in a photorealistic reconstruction. 3D Gaussian Splatting (3DGS) reconstructs a dense, renderable 3D scene from a set of posed images. Each Gaussian carries position, shape, color, and opacity. Rendering is done by splatting onto an image plane — no ray marching — enabling real-time (100+ FPS) novel view synthesis. This changes the sim-to-real pipeline: Three specific roles for 3DGS in sim-to-real: Photorealistic world generation. Instead of hand-authoring simulator worlds with approximate textures and lighting, you scan the target environment (warehouse, kitchen, hospital corridor) with a camera, train a Gaussian splat, and render training views from it. The policy trains on images that look like reality because they are derived from reality, re-rendered from novel viewpoints. This drastically reduces the visual domain gap. Infinite viewpoint augmentation. A single scan produces a continuous 3D field renderable from any pose. The robot can practice navigating the real space from viewpoints it has never physically visited. Unlike image augmentation (crop, color jitter), this produces geometrically consistent novel views with correct occlusion and parallax. Semantic simulation environments. With language-embedded splats (LEGaussians, LangSplat), the reconstructed world becomes queryable — “where is the couch?” returns a 3D location. A VLA agent can train navigation and manipulation in a photorealistic, semantically labeled environment derived from a real scan, without manual annotation. The progression from classical mapping to 3DGS-based simulation:| Representation | What it captures | Sim-to-real utility |
|---|---|---|
| Occupancy grid (SLAM) | Free/occupied cells | Collision avoidance only |
| Sparse point cloud (ORB-SLAM) | 3D keypoints | Re-localization landmarks |
| 3D Gaussian Splat | Dense geometry + appearance | Photorealistic training views |
| Language-embedded splat | Geometry + appearance + semantics | Queryable training environment |
Combining approaches
In practice, these techniques are complementary:- Use 3DGS to close the visual gap at the source
- Apply domain randomization on top for factors the scan does not capture (lighting changes, object rearrangement, sensor noise)
- Use system identification for dynamics-critical tasks
- Apply domain adaptation as a final fine-tuning step with a small amount of real-world data
References
- Bousmalis, K., Irpan, A., Wohlhart, P., Bai, Y., Kelcey, M., et al. (2017). Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping.
- Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., et al. (2017). Deep Q-learning from Demonstrations.
- Marco, A., Berkenkamp, F., Hennig, P., Schoellig, A., Krause, A., et al. (2017). Virtual vs. Real: Trading Off Simulations and Physical Experiments in Reinforcement Learning with Bayesian Optimization.
- Pan, X., You, Y., Wang, Z., Lu, C. (2017). Virtual to Real Reinforcement Learning for Autonomous Driving.
- Weber, T., Racanière, S., Reichert, D., Buesing, L., Guez, A., et al. (2017). Imagination-Augmented Agents for Deep Reinforcement Learning.

