3D Gaussian Splatting for Robot Navigation
Introduction
Traditional robotics mapping produces sparse point clouds (ORB-SLAM) or 2D occupancy grids (SLAM Toolbox). Neither captures visual appearance. A robot with an occupancy grid of a kitchen knows where the walls are but cannot answer “what does the kitchen look like from the doorway?” or “is this the same kitchen I saw before?” 3D Gaussian Splatting (3DGS) produces a dense, renderable 3D representation by fitting a collection of 3D Gaussians to a set of posed images. Each Gaussian has a position, covariance (shape), color (via spherical harmonics), and opacity. Rendering is done by splatting these Gaussians onto an image plane — no ray marching needed — enabling real-time (100+ FPS) novel view synthesis. For robotics, this enables:- Visual re-localization — render what the robot expects to see from a candidate pose, compare with what it actually sees
- Scene understanding — with language-embedded splats, the 3D map becomes queryable: “where is the couch?” returns a rendered view and a pose
- Navigation planning — dense 3D geometry from splats generates richer obstacle representations than sparse point clouds
| Part | Focus | Environment |
|---|---|---|
| Part 1 | Capture and train your first Gaussian Splat | Your own room (any camera) |
| Part 2 | Integrate GS capture into the ROS pipeline | Gazebo simulation (TurtleBot + D435i) |
| Part 3 | Build a semantic navigation agent on top of a language-embedded splat | Simulation + LLM |
Prerequisites
- Completed Camera Calibration assignment
- A camera for scene capture — any of the following works:
- iPhone 12 Pro or newer (LiDAR) with Record3D — best quality, no COLMAP needed
- Any iPhone/Android with Polycam or Scaniverse
- Laptop webcam or USB camera — use COLMAP for pose estimation (see below)
- Workstation with NVIDIA GPU (RTX 3060+ for preview quality, RTX 3090/4090 for full quality)
- Docker with NVIDIA Container Toolkit installed
COLMAP Installation
COLMAP is a Structure-from-Motion (SfM) and Multi-View Stereo (MVS) pipeline that estimates camera poses from unposed images. It is required if you are using a monocular camera (laptop webcam, USB camera, or phone without LiDAR). Option A — Inside the Nerfstudio Docker container (recommended): COLMAP is pre-installed in the Nerfstudio Docker image. No additional setup needed. Runns-process-data images inside the container and it will call COLMAP automatically.
Option B — Native installation on Ubuntu:
ns-process-data images command wraps COLMAP and handles the full SfM pipeline for you — you do not need to call COLMAP directly.
Background Reading
Before starting, study these resources in order:- Conceptual overview: Introduction to 3D Gaussian Splatting (Hugging Face blog) — accessible explanation of the theory
- Original paper: 3D Gaussian Splatting for Real-Time Radiance Field Rendering (SIGGRAPH 2023) — the foundational work
- Robotics survey: 3DGS in Robotics: A Survey — how 3DGS is being used for SLAM, navigation, and manipulation
- Nerfstudio documentation: Splatfacto method — the training framework you will use
Compute Requirements
| Task | VRAM | GPU | Time |
|---|---|---|---|
| Training splatfacto, 7K iterations (preview) | 12-16 GB | RTX 3060+ | ~10 min |
| Training splatfacto, 30K iterations (full) | 24 GB | RTX 3090/4090 | 30-40 min |
| Training with gsplat (optimized) | 6 GB (4x less) | RTX 3090/4090 | 15% faster |
| Rendering trained splat | — | Any CUDA GPU | 100-200+ FPS |
| COLMAP SfM (200 images) | 2-4 GB | CPU or CUDA GPU | 5-30 min |
Part 1: Capture and Train Your First Gaussian Splat
In this part you will capture a real-world scene (your room, living space, or lab) using any available camera and train a Gaussian Splat from the captured data. The goal is to build intuition for how capture quality affects reconstruction quality before moving to the robotic pipeline.Task 1.1: Scene Capture
Choose a room-scale scene (3-6 meters across). Good scenes have varied geometry and texture — avoid blank white walls. Follow these capture best practices regardless of camera type:- Walk slowly — fast motion causes blur and tracking loss
- Maintain 70-80% overlap between consecutive frames
- Capture from multiple heights (standing, crouching) to cover vertical surfaces
- Cover the full scene — gaps in coverage become holes in reconstruction
- Aim for 200-500 frames for a single room
Path A: iPhone with LiDAR (best quality)
If you have an iPhone 12 Pro or newer:- Install Record3D from the App Store.
- Record a walkthrough of the scene.
- Export as “EXR + JPG sequence”.
- Also export “Zipped PLY point clouds” — these provide a LiDAR-based point cloud that dramatically improves Gaussian initialization.
- Transfer both exports to your workstation.
Path B: Any phone with Polycam/Scaniverse
If you have any iPhone or Android phone:- Install Polycam or Scaniverse.
- Use the app’s photo mode to capture the scene (follow the overlap guidelines above).
- Export the capture and transfer to your workstation.
Path C: Laptop webcam or USB camera (COLMAP)
If you only have a laptop webcam or USB camera:-
Capture images: Record a video of your room, then extract frames, or take individual photos while walking around the scene. For video extraction:
Alternatively, write a short Python script using OpenCV to capture frames from the webcam at regular intervals as you walk around the room.
- Use your calibration: If you completed the Camera Calibration assignment, you can provide your camera intrinsics to improve COLMAP’s accuracy. Otherwise, COLMAP will estimate intrinsics automatically (less accurate, especially for wide-angle laptop cameras with significant distortion).
-
Process with COLMAP (inside the Nerfstudio container — COLMAP is pre-installed):
This runs the full COLMAP SfM pipeline: feature extraction (SIFT), feature matching, sparse reconstruction, and undistortion. Expect 5-30 minutes depending on image count and whether GPU-accelerated matching is available.
-
Troubleshooting COLMAP failures: If COLMAP fails to reconstruct (common causes: insufficient overlap, textureless surfaces, motion blur):
- Check that at least 70% of image pairs have matched features —
ns-process-datawill report the number of registered images - Re-capture with slower movement and more overlap
- Try
--feature-type superpoint --matcher-type supergluefor more robust matching (requires the SuperPoint/SuperGlue models)
- Check that at least 70% of image pairs have matched features —
| Path | Poses from | Depth available | Point cloud init | COLMAP needed | Quality |
|---|---|---|---|---|---|
| A: Record3D (LiDAR) | ARKit | Yes (LiDAR) | Yes (PLY export) | No | Best |
| B: Polycam/Scaniverse | App SfM or ARKit | Sometimes | Sometimes | Sometimes | Good |
| C: Webcam + COLMAP | COLMAP SfM | No | SfM sparse points | Yes | Adequate |
Task 1.2: Nerfstudio Setup
You need a working Nerfstudio environment with GPU support. Choose one of the following options:Option A: Official Nerfstudio Docker image (recommended)
The Nerfstudio project publishes an official Docker image on GitHub Container Registry that includes Nerfstudio, COLMAP, PyTorch, and CUDA pre-configured.ghcr.io/nerfstudio-project/nerfstudio:1.1.3). See the GHCR package page for all available tags.
If you prefer to build the image yourself (e.g., to target a specific CUDA architecture):
Option B: Native pip install
If you prefer a local installation without Docker:Task 1.3: Train Your First Splat
Inside your Nerfstudio environment (Docker or native), train splatfacto on your processed data:Task 1.4: Evaluate and Experiment
Once your first splat is trained, run evaluation and begin experimenting:-
Evaluate your trained splat using Nerfstudio’s built-in metrics:
Report PSNR, SSIM, and LPIPS on the held-out test views.
-
Export the trained splat for external viewing:
- Experiment with at least two of the following variations and compare results:
| Experiment | What to vary | What to measure | Notes |
|---|---|---|---|
| Training duration | 7K vs 15K vs 30K iterations | PSNR/SSIM, training time, visual quality | All paths |
| Method variant | splatfacto vs splatfacto-big | Quality metrics, VRAM usage, training time | All paths |
| Depth supervision | With vs without depth_file_path in transforms.json | Geometric accuracy, floaters | Path A/B only (requires depth) |
| Point cloud init | With vs without --ply during processing | Convergence speed, final quality | Path A only (requires LiDAR PLY) |
| Quality tuning | Default vs tuned (--pipeline.model.cull_alpha_thresh=0.005 --pipeline.model.continue_cull_post_densification=False --pipeline.model.use_scale_regularization=True) | Floaters, edge quality | All paths |
| Capture density | Vary frame extraction rate (e.g., 1 FPS vs 3 FPS vs 5 FPS) | Quality vs training time | Path C (video extraction) |
- Capture a second scene — a different room or area — and train a splat. You will use both scenes to discuss coverage vs. quality tradeoffs in your report.
Deliverables for Part 1
- Screenshots or screen recordings of the Nerfstudio viewer showing your trained splats
- Evaluation metrics (PSNR, SSIM, LPIPS) for each experiment variation
- Written analysis: what capture choices and training configurations affected quality and why
Part 2: Gaussian Splatting in the ROS Pipeline
In Part 1 you captured real scenes with a physical camera. Now you will integrate Gaussian Splatting into the TurtleBot3 simulation pipeline, where RGB-D images and poses come from Gazebo via ROS 2 and Zenoh. The resulting splat will be a faithful 3D reconstruction of the Gazebo house world — it will look as realistic (or as synthetic) as the Gazebo renderer itself. The value is not photorealism but the representation: a dense, renderable, queryable 3D map that an occupancy grid or sparse point cloud cannot provide. Gazebo’s deterministic renderer and ground-truth odometry give you ideal conditions for 3DGS (perfect multi-view consistency, zero pose drift), letting you focus on the capture pipeline and exploration strategy rather than sensor noise.Architecture Overview
The capture pipeline follows the same Zenoh subscriber pattern as the existingobject_detector.py:
Task 2.1: Implement gs_capture.py
Write a Zenoh capture script (detector/gs_capture.py) that subscribes to the robot’s RGB, depth, and odometry streams and writes keyframes in Nerfstudio format.
Data sources (Zenoh keys):
| Zenoh Key | ROS Topic | Data | Frame |
|---|---|---|---|
camera/color/image_raw | /camera/color/image_raw | RGB image (sensor_msgs/Image, CDR) | camera_optical_frame |
camera/depth/image_rect_raw | /camera/depth/image_rect_raw | Depth image (float32 meters, CDR) | camera_depth_frame |
odom | /odom | Robot odometry (nav_msgs/Odometry, CDR) | odom → base_link |
- Frame synchronization: Capture is RGB-triggered. When an RGB frame arrives, pair it with the most recent depth and odom. Drop the frame if depth or odom is more than 200 ms stale.
-
Keyframe gating: Same logic as
object_detector.py— only capture when the robot has moved more than a distance threshold or rotated more than an angle threshold since the last keyframe. Expose these as CLI arguments (defaults: 0.3 m, 10 degrees). -
Coordinate transform: This is the most critical part. Nerfstudio’s
transform_matrixis camera-to-world in OpenGL convention (x-right, y-up, z-backward). ROS uses x-forward, y-left, z-up for the robot body. The transform chain is:Where:T_odom_baselink— from the/odommessage (position + quaternion → 4x4 matrix)T_baselink_camera— static offset from URDF: translation[0.064, -0.065, 0.094], no rotationT_ros_optical_to_nerfstudio— 180-degree rotation around x-axis (flips y and z)
-
Depth conversion: Gazebo produces float32 meters. Nerfstudio expects 16-bit PNG in millimeters:
- Missing odom handling: If no odom has been received, capture pauses. Frames without valid poses are never written. Log a warning if odom is missing for more than 10 seconds.
-
Output format: Write
transforms.jsonwith camera intrinsics (from the simulated D435i: 320x240, fx=fy=277.13, cx=160, cy=120) and per-frame entries withfile_path,depth_file_path, andtransform_matrix.
object_detector.py for the Zenoh subscription pattern, CDR deserialization, and keyframe gating logic. Your capture script follows the same structure but writes a different output format.
Task 2.2: Capture and Train from Simulation
Use the capture script to collect data from the Gazebo house world:Task 2.3: Exploration Strategy Experiments
The quality of a Gaussian Splat depends heavily on how well the scene was covered during capture. Run at least two of these experiments:| Experiment | Description |
|---|---|
| Teleop vs Nav2 waypoints | Compare manual exploration with pre-planned waypoint sequences. Which gives better coverage? |
| Dense vs sparse capture | Compare --keyframe-dist 0.1 --keyframe-angle 5 with --keyframe-dist 1.0 --keyframe-angle 30. How does frame count affect quality and training time? |
| Partial vs full coverage | Train a splat from a single room capture vs. the full house. Where do reconstruction holes appear? |
| Depth supervision | Remove depth_file_path entries from transforms.json and retrain. Compare geometric accuracy and floater artifacts. |
Task 2.4: Sim-to-Real Comparison
Compare your simulation splat (Part 2) with your real-world splat (Part 1). This comparison is designed to make the sim-real gap concrete and measurable.- Render 5 novel views from random poses in each trained splat.
- Qualitatively compare: texture detail, geometric accuracy, lighting, artifacts.
- Discuss:
- What are the fundamental differences between simulated and real-world captures?
- Where does the Gazebo splat faithfully reproduce geometry but fail on visual realism?
- What would need to change in the Gazebo world (materials, lighting, sensor noise) to narrow the gap?
- How does the sim-real gap affect downstream tasks like semantic queries (Part 3)?
Deliverables for Part 2
- Source code for
gs_capture.pywith inline comments explaining the coordinate transform - A unit test that verifies the coordinate transform: given a known ROS pose, assert the expected Nerfstudio matrix
- Evaluation metrics and screenshots for each exploration experiment
- Written sim-to-real comparison analysis
Part 3: Semantic Navigation Agent with Language-Embedded Splats
From Closed Vocabulary to Open-World Semantics
In previous assignments, the robot’s object detection pipeline used COCO-trained models (YOLO, Faster R-CNN) that recognize a fixed set of 80 object classes. If “refrigerator” happens to be one of those 80 classes, the detector finds it. If you ask for “the red mug on the counter” or “the fire extinguisher” — objects outside the training vocabulary — the detector is blind. Language-embedded Gaussian Splatting changes this fundamentally. By embedding CLIP features into each Gaussian, the 3D map becomes queryable with arbitrary natural language — not just a fixed class list. CLIP was trained on 400 million image-text pairs from the internet, so it encodes a broad understanding of visual concepts. A language-embedded splat can answer queries like “wooden bookshelf,” “potted plant near the window,” or “the area that looks like a kitchen” — none of which would match a COCO class label. This is the shift from closed-vocabulary detection (what the model was trained on) to open-vocabulary scene understanding (what language can describe). Combined with an LLM agent, it enables navigation commands in natural language rather than explicit coordinates.Background: Language-Embedded Gaussian Splatting
Several recent works embed CLIP or DINO features into each Gaussian, creating a 3D scene representation that is both visually renderable and semantically queryable:| Method | Key Idea | Speed | Reference |
|---|---|---|---|
| LEGS (Berkeley) | Built on Nerfstudio Splatfacto; runs on mobile robot; 66% open-vocab accuracy | 3.5x faster training than LERF | IROS 2024 |
| LangSplat | Scene-wise language autoencoder; CLIP features in latent space per Gaussian | 199x faster than LERF | CVPR 2024 |
| LangSplatV2 | Global 3D codebook for high-dim features | 450+ FPS rendering | NeurIPS 2025 |
| LEGaussians | Quantized CLIP + DINO features as discrete indices per Gaussian | Real-time rendering | CVPR 2024 |
| Feature 3DGS | Learns arbitrary feature fields (CLIP, SAM, etc.) alongside color | Real-time rendering | CVPR 2024 |
Task 3.1: Train a Language-Embedded Splat
Using your house world capture from Part 2, train a language-embedded Gaussian Splat.- Study the LEGS pipeline: Read the LEGS paper and its Nerfstudio integration. LEGS extends Splatfacto by adding a per-Gaussian language feature vector trained alongside the visual features. At query time, you render the language features from any viewpoint and compare with a text query’s CLIP embedding.
- Generate CLIP annotations: Before training, you need to generate per-image CLIP feature maps from your captured RGB images. LEGS provides scripts for this. The annotations are used as supervision during training alongside the photometric loss.
-
Train the language-embedded splat:
-
Test open-vocabulary queries: Once trained, query the splat with natural language. Start with queries that would work with a COCO detector, then go beyond:
COCO-equivalent queries (baseline):
- “refrigerator”
- “couch”
- “dining table”
- “kitchen counter”
- “the doorway to the bedroom”
- “something to sit on”
- “area with appliances”
Task 3.2: Build the Semantic Query Interface
Build a Python module that takes a trained language-embedded splat and a text query and returns:- Object location — the 3D position (x, y, z) of the queried object in the world frame
- Relevancy map — a rendered image showing where the query matches in the scene
- Confidence score — cosine similarity between the query embedding and the best-matching region
Task 3.3: LLM-Powered Navigation Agent
Build an agent that accepts natural language navigation commands and translates them into robot trajectories using the semantic splat as a world model. Architecture:| Query Type | Example | Expected Behavior |
|---|---|---|
| Navigate to object | ”go to the refrigerator” | Resolve object location via semantic splat, plan path via Nav2, execute |
| Show trajectories | ”show me all the top trajectories that reach the refrigerator location” | Resolve location, compute multiple candidate paths, render expected views along each path from the splat, return ranked results |
| Scene query | ”what objects are near the dining table?” | Query multiple object types, compute spatial relationships, return natural language description |
| Exploration | ”explore the area around the kitchen” | Resolve “kitchen” region, generate waypoints for coverage, optionally extend the splat with new captures |
-
LLM tool interface: Define the agent’s tools as functions the LLM can call:
query_object_location(object_name: str) → \{position, confidence\}get_robot_pose() → \{x, y, z, yaw\}plan_path(start, goal) → list[Pose]render_view_from_splat(pose) → Imagesend_nav2_goal(x, y, yaw) → status
-
Trajectory ranking: For the “show me top trajectories” query, consider ranking by:
- Path length (shorter is better)
- Clearance from obstacles (safer paths)
- Visual coverage (paths that pass through well-reconstructed areas of the splat, where the robot can verify its position)
- Semantic relevance (paths that pass by related objects)
- Splat-based visualization: For each candidate trajectory, render what the robot would see at key waypoints along the path using the trained splat. This gives the user a preview of the journey before committing to execution.
-
ROS integration: The agent communicates with the robot through the existing ROS 2 / Zenoh infrastructure:
- Read current pose from
/odom - Send goals via Nav2’s
NavigateToPoseaction - Monitor progress via Nav2 feedback
- Read current pose from
- Splat-Loc (Stanford) — render-and-compare re-localization at ~25 Hz against a pre-built GS map. Relevant for verifying the robot’s position against the splat during trajectory execution.
- GS-Loc (RAL 2025) — vision foundation model-driven re-localization using Gaussian Splats.
- HAMMER (arxiv.org/abs/2501.14147) — multi-robot collaborative semantic GS with ROS communication. Shows how to architect the ROS-to-splat interface.
- ROSplat (github.com/shadygm/ROSplat) — ROS 2 Jazzy package for online GS visualization with custom
GaussianArraymessages.
Task 3.4: Evaluation
Evaluate your semantic navigation agent:- Object localization accuracy: For 5 objects in the house world, compare the semantic splat’s estimated position with the ground-truth position from the Gazebo world file. Report the position error in meters.
-
Open-vocabulary vs. closed-vocabulary: For 10 queries (5 within COCO vocabulary, 5 outside it), compare:
- Does the COCO detector find the object? (Yes/No)
- Does the semantic splat find the object? (Yes/No, with confidence score)
- Position accuracy for objects both systems can detect
-
Query success rate: Issue 10 natural language navigation queries of varying complexity. For each query, record:
- Did the agent correctly parse the intent?
- Did the semantic splat return a valid location?
- Did the planned trajectory successfully reach the goal?
- Were the rendered splat views useful for previewing the trajectory?
- Trajectory quality: For the “show me top trajectories” query, compare the top-3 trajectories by path length, estimated clearance, and visual quality of the rendered waypoint views.
Deliverables for Part 3
- Source code for the semantic query interface (
gs_semantic_query.py) - Source code for the LLM navigation agent with tool definitions
- Demo video showing at least 3 different semantic navigation queries being executed
- Evaluation results: object localization accuracy, open-vocab vs closed-vocab comparison, query success rate, trajectory comparison
- Written analysis: strengths and limitations of using a language-embedded splat as a world model for navigation
References
Foundational
- 3D Gaussian Splatting for Real-Time Radiance Field Rendering — SIGGRAPH 2023
- 3DGS in Robotics: A Survey — comprehensive survey of 3DGS applications in robotics
- Awesome 3DGS SLAM — curated list of GS-SLAM papers
GS-SLAM Systems
- RTG-SLAM — real-time RGB-D Gaussian SLAM (SIGGRAPH 2024)
- SplaTAM — best reconstruction quality for RGB-D (CVPR 2024)
- Photo-SLAM — runs on Jetson AGX Orin (CVPR 2024)
Semantic / Language-Embedded Splats
- LEGS — language-embedded Gaussian Splatting for mobile robots (IROS 2024)
- LangSplat — CLIP features per Gaussian (CVPR 2024 Highlight)
- Feature 3DGS — arbitrary feature fields alongside color (CVPR 2024)
- HAMMER — multi-robot collaborative semantic GS with ROS (RAL 2025)
Re-localization
- Splat-Loc — render-and-compare re-localization at ~25 Hz
- GS-Loc — vision foundation model-driven (RAL 2025)
Frameworks and Tools
- Nerfstudio — full training pipeline, Splatfacto method
- gsplat — optimized PyTorch library (4x less memory)
- COLMAP — Structure-from-Motion for camera pose estimation from unposed images (install guide, tutorial)
- ROSplat — ROS 2 Jazzy GS visualization
- Record3D — iPhone LiDAR capture app
- Polycam — alternative mobile capture
- Scaniverse — cross-platform mobile capture (iPhone and Android)
Summary of Deliverables
| Part | Deliverables |
|---|---|
| Part 1 | Trained splat of your room, evaluation metrics, experiment comparisons, written analysis |
| Part 2 | gs_capture.py source + unit test, sim splat with exploration experiments, sim-to-real comparison |
| Part 3 | Semantic query module, LLM navigation agent, demo video, evaluation results, limitations analysis |
/data/ns_outputs (mounted from ./data/ns_outputs on the host). Always use --output-dir /data/ns_outputs to keep artifacts in the mounted volume.

