3DGS for Navigation (Nerfstudio)

Introduction

Traditional robotics mapping produces sparse point clouds (ORB-SLAM) or 2D occupancy grids (SLAM Toolbox). Neither captures visual appearance. A robot with an occupancy grid of a kitchen knows where the walls are but cannot answer “what does the kitchen look like from the doorway?” or “is this the same kitchen I saw before?” 3D Gaussian Splatting (3DGS) produces a dense, renderable 3D representation by fitting a collection of 3D Gaussians to a set of posed images. Each Gaussian has a position, covariance (shape), color (via spherical harmonics), and opacity. Rendering is done by splatting these Gaussians onto an image plane, no ray marching needed, enabling real-time (100+ FPS) novel view synthesis. For robotics, this enables:

Visual re-localization, render what the robot expects to see from a candidate pose, compare with what it actually sees
Scene understanding, with language-embedded splats, the 3D map becomes queryable: “where is the couch?” returns a rendered view and a pose
Navigation planning, dense 3D geometry from splats generates richer obstacle representations than sparse point clouds

An important note on simulation vs. reality. In Part 2 of this project you will train a Gaussian Splat from images captured by a simulated D435i camera in the Gazebo house world. 3DGS faithfully reconstructs whatever it sees, if the input images are synthetic, the output splat will look synthetic too. A Gaussian Splat cannot make Gazebo renders more photorealistic than they already are. What 3DGS does give you, even from synthetic input, is a dense, view-consistent 3D representation that supports novel view synthesis, depth rendering, and semantic queries, capabilities that occupancy grids and sparse point clouds lack. This works because 3DGS requires multi-view consistency and accurate camera poses, not photorealism. Gazebo satisfies both: its deterministic renderer produces perfectly consistent images across viewpoints, and simulated odometry provides ground-truth poses with zero drift. The original 3DGS paper evaluated on the synthetic Blender/NeRF-Synthetic dataset, and all major GS-SLAM systems (SplaTAM, GS-SLAM, RTG-SLAM) benchmark on Replica, a synthetic indoor dataset rendered with OpenGL. That said, Gazebo’s default rendering quality (Ogre2) is below Replica’s level of fidelity, so expect the reconstruction to reflect the visual quality of Gazebo’s house world, not a photorealistic scan. The sim-to-real comparison in Task 2.4 is designed to surface exactly this gap. Gazebo’s rendering pipeline does support higher-fidelity materials (PBR with normal/roughness/metalness maps, baked light maps with global illumination, and configurable sensor noise), but the default house world does not fully exploit these capabilities. This project has three parts:

Part	Focus	Environment
Part 1	Capture and train your first Gaussian Splat	Your own room (any camera)
Part 2	Integrate GS capture into the ROS pipeline	Gazebo simulation (TurtleBot + D435i)
Part 3	Build a semantic navigation agent on top of a language-embedded splat	Simulation + LLM

Prerequisites

Completed Camera Calibration lecture
A camera for scene capture, any of the following works:
- iPhone 12 Pro or newer (LiDAR) with Record3D, best quality, no COLMAP needed
- Any iPhone/Android with Polycam or Scaniverse
- Laptop webcam or USB camera, use COLMAP for pose estimation (see below)
Workstation with NVIDIA GPU (RTX 3060+ for preview quality, RTX 3090/4090 for full quality)
Docker with NVIDIA Container Toolkit installed

COLMAP Installation

COLMAP is a Structure-from-Motion (SfM) and Multi-View Stereo (MVS) pipeline that estimates camera poses from unposed images. It is required if you are using a monocular camera (laptop webcam, USB camera, or phone without LiDAR). Option A, Inside the Nerfstudio Docker container (recommended): COLMAP is pre-installed in the Nerfstudio Docker image. No additional setup needed. Run ns-process-data images inside the container and it will call COLMAP automatically. Option B, Native installation on Ubuntu:

sudo apt-get install colmap

For GPU-accelerated feature matching (significantly faster on large image sets), build from source with CUDA support. Follow the official build instructions. Option C, pip (CPU-only, slower but simplest):

pip install pycolmap

Nerfstudio’s ns-process-data images command wraps COLMAP and handles the full SfM pipeline for you, you do not need to call COLMAP directly.

Background Reading

Before starting, study these resources in order:

Conceptual overview: Introduction to 3D Gaussian Splatting (Hugging Face blog), accessible explanation of the theory
Original paper: 3D Gaussian Splatting for Real-Time Radiance Field Rendering (SIGGRAPH 2023), the foundational work
Robotics survey: 3DGS in Robotics: A Survey, how 3DGS is being used for SLAM, navigation, and manipulation
Nerfstudio documentation: Splatfacto method, the training framework you will use

Compute Requirements

Task	VRAM	GPU	Time
Training splatfacto, 7K iterations (preview)	12-16 GB	RTX 3060+	~10 min
Training splatfacto, 30K iterations (full)	24 GB	RTX 3090/4090	30-40 min
Training with gsplat (optimized)	6 GB (4x less)	RTX 3090/4090	15% faster
Rendering trained splat	,	Any CUDA GPU	100-200+ FPS
COLMAP SfM (200 images)	2-4 GB	CPU or CUDA GPU	5-30 min

Part 1: Capture and Train Your First Gaussian Splat

In this part you will capture a real-world scene (your room, living space, or lab) using any available camera and train a Gaussian Splat from the captured data. The goal is to build intuition for how capture quality affects reconstruction quality before moving to the robotic pipeline.

Task 1.1: Scene Capture

Choose a room-scale scene (3-6 meters across). Good scenes have varied geometry and texture, avoid blank white walls. Follow these capture best practices regardless of camera type:

Walk slowly, fast motion causes blur and tracking loss
Maintain 70-80% overlap between consecutive frames
Capture from multiple heights (standing, crouching) to cover vertical surfaces
Cover the full scene, gaps in coverage become holes in reconstruction
Aim for 200-500 frames for a single room

Path A: iPhone with LiDAR (best quality)

If you have an iPhone 12 Pro or newer:

Install Record3D from the App Store.
Record a walkthrough of the scene.
Export as “EXR + JPG sequence”.
Also export “Zipped PLY point clouds”, these provide a LiDAR-based point cloud that dramatically improves Gaussian initialization.
Transfer both exports to your workstation.

Processing (inside the Nerfstudio container):

ns-process-data record3d \
  --data /data/captures/my_room_r3d \
  --ply /data/captures/my_room_ply \
  --output-dir /data/captures/my_room_processed \
  --voxel-size 0.5

No COLMAP needed, poses come from ARKit.

Path B: Any phone with Polycam/Scaniverse

If you have any iPhone or Android phone:

Install Polycam or Scaniverse.
Use the app’s photo mode to capture the scene (follow the overlap guidelines above).
Export the capture and transfer to your workstation.

Processing:

# Polycam export
ns-process-data polycam --data /data/captures/polycam_export --output-dir /data/captures/my_room_processed

# Scaniverse / generic images
ns-process-data images --data /data/captures/my_room_images --output-dir /data/captures/my_room_processed

Path C: Laptop webcam or USB camera (COLMAP)

If you only have a laptop webcam or USB camera:

Capture images: Record a video of your room, then extract frames, or take individual photos while walking around the scene. For video extraction:
# Extract 1 frame per second from video (adjust -r for density) ffmpeg -i my_room_video.mp4 -r 1 -q:v 2 /data/captures/my_room_images/frame_%04d.jpg
Alternatively, write a short Python script using OpenCV to capture frames from the webcam at regular intervals as you walk around the room.
Use your calibration: If you completed the Camera Calibration assignment, you can provide your camera intrinsics to improve COLMAP’s accuracy. Otherwise, COLMAP will estimate intrinsics automatically (less accurate, especially for wide-angle laptop cameras with significant distortion).
Process with COLMAP (inside the Nerfstudio container, COLMAP is pre-installed):
ns-process-data images \ --data /data/captures/my_room_images \ --output-dir /data/captures/my_room_processed
This runs the full COLMAP SfM pipeline: feature extraction (SIFT), feature matching, sparse reconstruction, and undistortion. Expect 5-30 minutes depending on image count and whether GPU-accelerated matching is available.
Troubleshooting COLMAP failures: If COLMAP fails to reconstruct (common causes: insufficient overlap, textureless surfaces, motion blur):
- Check that at least 70% of image pairs have matched features, ns-process-data will report the number of registered images
- Re-capture with slower movement and more overlap
- Try --feature-type superpoint --matcher-type superglue for more robust matching (requires the SuperPoint/SuperGlue models)

Comparison of capture paths:

Path	Poses from	Depth available	Point cloud init	COLMAP needed	Quality
A: Record3D (LiDAR)	ARKit	Yes (LiDAR)	Yes (PLY export)	No	Best
B: Polycam/Scaniverse	App SfM or ARKit	Sometimes	Sometimes	Sometimes	Good
C: Webcam + COLMAP	COLMAP SfM	No	SfM sparse points	Yes	Adequate

Path C produces RGB-only splats without depth supervision. This is perfectly valid, the original 3DGS paper used RGB-only with COLMAP poses. You will lose depth supervision (one of the experiment variables in Task 1.3), and initialization will use COLMAP’s sparse point cloud instead of dense LiDAR points.

Task 1.2: Nerfstudio Setup

You need a working Nerfstudio environment with GPU support. Choose one of the following options:

Option A: Official Nerfstudio Docker image (recommended)

The Nerfstudio project publishes an official Docker image on GitHub Container Registry that includes Nerfstudio, COLMAP, PyTorch, and CUDA pre-configured.

docker run --gpus all \
  -u $(id -u) \
  -v $(pwd)/data/captures:/data/captures \
  -v $(pwd)/data/ns_outputs:/data/ns_outputs \
  -v $HOME/.cache:/home/user/.cache \
  -p 7007:7007 \
  --rm -it \
  --shm-size=12gb \
  ghcr.io/nerfstudio-project/nerfstudio:latest

Specific version tags are also available (e.g., ghcr.io/nerfstudio-project/nerfstudio:1.1.3). See the GHCR package page for all available tags. If you prefer to build the image yourself (e.g., to target a specific CUDA architecture):

git clone https://github.com/nerfstudio-project/nerfstudio.git
cd nerfstudio
docker build \
  --build-arg CUDA_ARCHITECTURES=86 \
  --tag nerfstudio-local \
  --file Dockerfile .

See the official Dockerfile for available build arguments.

Option B: Native pip install

If you prefer a local installation without Docker:

# Create a conda environment
conda create -n nerfstudio python=3.10 -y
conda activate nerfstudio

# Install PyTorch with CUDA (check https://pytorch.org for your CUDA version)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

# Install Nerfstudio (includes gsplat)
pip install nerfstudio

# Verify
ns-train --help

See the Nerfstudio installation guide for full details, including COLMAP installation for your platform.

Task 1.3: Train Your First Splat

Inside your Nerfstudio environment (Docker or native), train splatfacto on your processed data:

ns-train splatfacto \
  --data /data/captures/my_room_processed \
  --output-dir /data/ns_outputs

# Open localhost:7007 in browser to watch training live

Refer to the Nerfstudio custom data guide and the step-by-step walkthrough for detailed instructions.

Task 1.4: Evaluate and Experiment

Once your first splat is trained, run evaluation and begin experimenting:

Evaluate your trained splat using Nerfstudio’s built-in metrics:

ns-eval --load-config /data/ns_outputs/<run>/config.yml \
  --output-path /data/ns_outputs/<run>/eval_results.json

Report PSNR, SSIM, and LPIPS on the held-out test views.

Export the trained splat for external viewing:

ns-export gaussian-splat \
  --load-config /data/ns_outputs/<run>/config.yml \
  --output-dir /data/ns_outputs/<run>/export

Experiment with at least two of the following variations and compare results:

Experiment	What to vary	What to measure	Notes
Training duration	7K vs 15K vs 30K iterations	PSNR/SSIM, training time, visual quality	All paths
Method variant	`splatfacto` vs `splatfacto-big`	Quality metrics, VRAM usage, training time	All paths
Depth supervision	With vs without `depth_file_path` in transforms.json	Geometric accuracy, floaters	Path A/B only (requires depth)
Point cloud init	With vs without `--ply` during processing	Convergence speed, final quality	Path A only (requires LiDAR PLY)
Quality tuning	Default vs tuned (`--pipeline.model.cull_alpha_thresh=0.005 --pipeline.model.continue_cull_post_densification=False --pipeline.model.use_scale_regularization=True`)	Floaters, edge quality	All paths
Capture density	Vary frame extraction rate (e.g., 1 FPS vs 3 FPS vs 5 FPS)	Quality vs training time	Path C (video extraction)

Capture a second scene, a different room or area, and train a splat. You will use both scenes to discuss coverage vs. quality tradeoffs in your report.

Deliverables for Part 1

Screenshots or screen recordings of the Nerfstudio viewer showing your trained splats
Evaluation metrics (PSNR, SSIM, LPIPS) for each experiment variation
Written analysis: what capture choices and training configurations affected quality and why

Part 2: Gaussian Splatting in the ROS Pipeline

In Part 1 you captured real scenes with a physical camera. Now you will integrate Gaussian Splatting into the TurtleBot3 simulation pipeline, where RGB-D images and poses come from Gazebo via ROS 2 and Zenoh. The resulting splat will be a faithful 3D reconstruction of the Gazebo house world, it will look as realistic (or as synthetic) as the Gazebo renderer itself. The value is not photorealism but the representation: a dense, renderable, queryable 3D map that an occupancy grid or sparse point cloud cannot provide. Gazebo’s deterministic renderer and ground-truth odometry give you ideal conditions for 3DGS (perfect multi-view consistency, zero pose drift), letting you focus on the capture pipeline and exploration strategy rather than sensor noise.

Architecture Overview

The capture pipeline follows the same Zenoh subscriber pattern as the existing object_detector.py:

Gazebo House World (D435i sim)
    |
    | ros_gz_bridge → zenoh-bridge-ros2dds
    v
gs_capture.py (Zenoh subscriber)
    |
    | Keyframe gating (distance + angle)
    | ROS → Nerfstudio coordinate transform
    | Depth m → mm conversion
    v
data/captures/house_run/
    ├── transforms.json      (intrinsics + per-frame extrinsics)
    ├── images/              (RGB PNGs)
    └── depth/               (16-bit depth PNGs in mm)
    |
    v
ns-train splatfacto --data <path>
    |
    v
Trained Gaussian Splat of the House World

Task 2.1: Implement `gs_capture.py`

Write a Zenoh capture script (detector/gs_capture.py) that subscribes to the robot’s RGB, depth, and odometry streams and writes keyframes in Nerfstudio format. Data sources (Zenoh keys):

Zenoh Key	ROS Topic	Data	Frame
`camera/color/image_raw`	`/camera/color/image_raw`	RGB image (sensor_msgs/Image, CDR)	camera_optical_frame
`camera/depth/image_rect_raw`	`/camera/depth/image_rect_raw`	Depth image (float32 meters, CDR)	camera_depth_frame
`odom`	`/odom`	Robot odometry (nav_msgs/Odometry, CDR)	odom → base_link

Key design decisions you must implement:

Frame synchronization: Capture is RGB-triggered. When an RGB frame arrives, pair it with the most recent depth and odom. Drop the frame if depth or odom is more than 200 ms stale.
Keyframe gating: Same logic as object_detector.py, only capture when the robot has moved more than a distance threshold or rotated more than an angle threshold since the last keyframe. Expose these as CLI arguments (defaults: 0.3 m, 10 degrees).
Coordinate transform: This is the most critical part. Nerfstudio’s transform_matrix is camera-to-world in OpenGL convention (x-right, y-up, z-backward). ROS uses x-forward, y-left, z-up for the robot body. The transform chain is:
T_nerfstudio = T_odom_baselink @ T_baselink_camera @ T_ros_optical_to_nerfstudio
Where:
- T_odom_baselink, from the /odom message (position + quaternion → 4x4 matrix)
- T_baselink_camera, static offset from URDF: translation [0.064, -0.065, 0.094], no rotation
- T_ros_optical_to_nerfstudio, 180-degree rotation around x-axis (flips y and z)
Study the worked numeric example in the design spec to verify your implementation. An incorrect transform produces mirrored or inverted reconstructions.
Depth conversion: Gazebo produces float32 meters. Nerfstudio expects 16-bit PNG in millimeters:
depth_mm = (depth_m * 1000.0).clip(0, 65535).astype(np.uint16)
Missing odom handling: If no odom has been received, capture pauses. Frames without valid poses are never written. Log a warning if odom is missing for more than 10 seconds.
Output format: Write transforms.json with camera intrinsics (from the simulated D435i: 320x240, fx=fy=277.13, cx=160, cy=120) and per-frame entries with file_path, depth_file_path, and transform_matrix.

Reference: Study object_detector.py for the Zenoh subscription pattern, CDR deserialization, and keyframe gating logic. Your capture script follows the same structure but writes a different output format.

Task 2.2: Capture and Train from Simulation

Use the capture script to collect data from the Gazebo house world:

# 1. Launch the simulation stack
docker compose up -d demo-world-house zenoh-router zenoh-bridge gs-capture

# 2. Drive the robot through the house
#    Use teleop or Nav2 waypoints to explore the environment
#    Aim for systematic coverage, visit every room

# 3. Stop capture
docker compose stop gs-capture

# 4. Train (inside nerfstudio container)
docker exec -it turtlebot-maze-nerfstudio-1 bash
ns-train splatfacto \
  --data /data/captures/house_run \
  --output-dir /data/ns_outputs

# 5. View the trained splat
ns-viewer --load-config /data/ns_outputs/house_run/splatfacto/<timestamp>/config.yml

Task 2.3: Exploration Strategy Experiments

The quality of a Gaussian Splat depends heavily on how well the scene was covered during capture. Run at least two of these experiments:

Experiment	Description
Teleop vs Nav2 waypoints	Compare manual exploration with pre-planned waypoint sequences. Which gives better coverage?
Dense vs sparse capture	Compare `--keyframe-dist 0.1 --keyframe-angle 5` with `--keyframe-dist 1.0 --keyframe-angle 30`. How does frame count affect quality and training time?
Partial vs full coverage	Train a splat from a single room capture vs. the full house. Where do reconstruction holes appear?
Depth supervision	Remove `depth_file_path` entries from `transforms.json` and retrain. Compare geometric accuracy and floater artifacts.

Task 2.4: Sim-to-Real Comparison

Compare your simulation splat (Part 2) with your real-world splat (Part 1). This comparison is designed to make the sim-real gap concrete and measurable.

Render 5 novel views from random poses in each trained splat.
Qualitatively compare: texture detail, geometric accuracy, lighting, artifacts.
Discuss:
- What are the fundamental differences between simulated and real-world captures?
- Where does the Gazebo splat faithfully reproduce geometry but fail on visual realism?
- What would need to change in the Gazebo world (materials, lighting, sensor noise) to narrow the gap?
- How does the sim-real gap affect downstream tasks like semantic queries (Part 3)?

Deliverables for Part 2

Source code for gs_capture.py with inline comments explaining the coordinate transform
A unit test that verifies the coordinate transform: given a known ROS pose, assert the expected Nerfstudio matrix
Evaluation metrics and screenshots for each exploration experiment
Written sim-to-real comparison analysis

From Closed Vocabulary to Open-World Semantics

In previous assignments, the robot’s object detection pipeline used COCO-trained models (YOLO, Faster R-CNN) that recognize a fixed set of 80 object classes. If “refrigerator” happens to be one of those 80 classes, the detector finds it. If you ask for “the red mug on the counter” or “the fire extinguisher”, objects outside the training vocabulary, the detector is blind. Language-embedded Gaussian Splatting changes this fundamentally. By embedding CLIP features into each Gaussian, the 3D map becomes queryable with arbitrary natural language, not just a fixed class list. CLIP was trained on 400 million image-text pairs from the internet, so it encodes a broad understanding of visual concepts. A language-embedded splat can answer queries like “wooden bookshelf,” “potted plant near the window,” or “the area that looks like a kitchen”, none of which would match a COCO class label. This is the shift from closed-vocabulary detection (what the model was trained on) to open-vocabulary scene understanding (what language can describe). Combined with an LLM agent, it enables navigation commands in natural language rather than explicit coordinates.

Background: Language-Embedded Gaussian Splatting

Several recent works embed CLIP or DINO features into each Gaussian, creating a 3D scene representation that is both visually renderable and semantically queryable:

Method	Key Idea	Speed	Reference
LEGS (Berkeley)	Built on Nerfstudio Splatfacto; runs on mobile robot; 66% open-vocab accuracy	3.5x faster training than LERF	IROS 2024
LangSplat	Scene-wise language autoencoder; CLIP features in latent space per Gaussian	199x faster than LERF	CVPR 2024
LangSplatV2	Global 3D codebook for high-dim features	450+ FPS rendering	NeurIPS 2025
LEGaussians	Quantized CLIP + DINO features as discrete indices per Gaussian	Real-time rendering	CVPR 2024
Feature 3DGS	Learns arbitrary feature fields (CLIP, SAM, etc.) alongside color	Real-time rendering	CVPR 2024

Recommendation: Start with LEGS as it is built directly on Nerfstudio’s Splatfacto (which you already used in Parts 1-2) and is designed for mobile robot scenarios. If you want higher semantic quality for offline analysis, explore LangSplat.

Task 3.1: Train a Language-Embedded Splat

Using your house world capture from Part 2, train a language-embedded Gaussian Splat.

Study the LEGS pipeline: Read the LEGS paper and its Nerfstudio integration. LEGS extends Splatfacto by adding a per-Gaussian language feature vector trained alongside the visual features. At query time, you render the language features from any viewpoint and compare with a text query’s CLIP embedding.
Generate CLIP annotations: Before training, you need to generate per-image CLIP feature maps from your captured RGB images. LEGS provides scripts for this. The annotations are used as supervision during training alongside the photometric loss.

Train the language-embedded splat:

# Exact command depends on LEGS version, consult the LEGS README
# The general pattern:
ns-train legs-splatfacto \
  --data /data/captures/house_run \
  --output-dir /data/ns_outputs

Test open-vocabulary queries: Once trained, query the splat with natural language. Start with queries that would work with a COCO detector, then go beyond: COCO-equivalent queries (baseline):
- “refrigerator”
- “couch”
- “dining table”
Open-vocabulary queries (the new capability):
- “kitchen counter”
- “the doorway to the bedroom”
- “something to sit on”
- “area with appliances”
For each query, the system should return a relevancy heatmap rendered from one or more viewpoints, highlighting which Gaussians match the text query. Record the top-1 location (3D centroid of the highest-relevance region) for each query. Compare the open-vocabulary results with what a COCO detector could have found, this is the concrete evidence of the capability shift. Note: CLIP features are learned from image appearance. On Gazebo-rendered scenes, semantic accuracy will depend on how recognizable the simulated objects are to CLIP (which was trained on real-world images). Objects with distinctive shapes (refrigerator, couch) may work better than those that rely on realistic textures.

Task 3.2: Build the Semantic Query Interface

Build a Python module that takes a trained language-embedded splat and a text query and returns:

Object location, the 3D position (x, y, z) of the queried object in the world frame
Relevancy map, a rendered image showing where the query matches in the scene
Confidence score, cosine similarity between the query embedding and the best-matching region

The interface should support queries like:

from gs_semantic_query import SemanticSplatQuery

query_engine = SemanticSplatQuery(
    model_config="/data/ns_outputs/house_run/legs-splatfacto/<timestamp>/config.yml"
)

result = query_engine.query("refrigerator")
# result.position  → np.array([x, y, z]) in odom frame
# result.relevancy_image → np.ndarray (H, W, 3)
# result.confidence → float (0-1)

Build an agent that accepts natural language navigation commands and translates them into robot trajectories using the semantic splat as a world model. Architecture:

User prompt
  "show me all the top trajectories that reach the refrigerator location"
    |
    v
LLM Agent (Claude / GPT-4)
    |
    | 1. Parse intent → target object = "refrigerator"
    | 2. Query semantic splat → get 3D location
    | 3. Get current robot pose from /odom
    | 4. Query Nav2 planner for candidate paths
    | 5. Rank trajectories by criteria
    | 6. Render expected views along top trajectories from splat
    v
Response with trajectory visualizations

The agent should support at minimum these types of queries:

Query Type	Example	Expected Behavior
Navigate to object	”go to the refrigerator”	Resolve object location via semantic splat, plan path via Nav2, execute
Show trajectories	”show me all the top trajectories that reach the refrigerator location”	Resolve location, compute multiple candidate paths, render expected views along each path from the splat, return ranked results
Scene query	”what objects are near the dining table?”	Query multiple object types, compute spatial relationships, return natural language description
Exploration	”explore the area around the kitchen”	Resolve “kitchen” region, generate waypoints for coverage, optionally extend the splat with new captures

Implementation guidance:

LLM tool interface: Define the agent’s tools as functions the LLM can call:
- query_object_location(object_name: str) → \{position, confidence\}
- get_robot_pose() → \{x, y, z, yaw\}
- plan_path(start, goal) → list[Pose]
- render_view_from_splat(pose) → Image
- send_nav2_goal(x, y, yaw) → status
Trajectory ranking: For the “show me top trajectories” query, consider ranking by:
- Path length (shorter is better)
- Clearance from obstacles (safer paths)
- Visual coverage (paths that pass through well-reconstructed areas of the splat, where the robot can verify its position)
- Semantic relevance (paths that pass by related objects)
Splat-based visualization: For each candidate trajectory, render what the robot would see at key waypoints along the path using the trained splat. This gives the user a preview of the journey before committing to execution.
ROS integration: The agent communicates with the robot through the existing ROS 2 / Zenoh infrastructure:
- Read current pose from /odom
- Send goals via Nav2’s NavigateToPose action
- Monitor progress via Nav2 feedback

Research pointers for implementation:

Splat-Loc (Stanford), render-and-compare re-localization at ~25 Hz against a pre-built GS map. Relevant for verifying the robot’s position against the splat during trajectory execution.
GS-Loc (RAL 2025), vision foundation model-driven re-localization using Gaussian Splats.
HAMMER (arxiv.org/abs/2501.14147), multi-robot collaborative semantic GS with ROS communication. Shows how to architect the ROS-to-splat interface.
ROSplat (github.com/shadygm/ROSplat), ROS 2 Jazzy package for online GS visualization with custom GaussianArray messages.

Task 3.4: Evaluation

Evaluate your semantic navigation agent:

Object localization accuracy: For 5 objects in the house world, compare the semantic splat’s estimated position with the ground-truth position from the Gazebo world file. Report the position error in meters.
Open-vocabulary vs. closed-vocabulary: For 10 queries (5 within COCO vocabulary, 5 outside it), compare:
- Does the COCO detector find the object? (Yes/No)
- Does the semantic splat find the object? (Yes/No, with confidence score)
- Position accuracy for objects both systems can detect
Query success rate: Issue 10 natural language navigation queries of varying complexity. For each query, record:
- Did the agent correctly parse the intent?
- Did the semantic splat return a valid location?
- Did the planned trajectory successfully reach the goal?
- Were the rendered splat views useful for previewing the trajectory?
Trajectory quality: For the “show me top trajectories” query, compare the top-3 trajectories by path length, estimated clearance, and visual quality of the rendered waypoint views.

Deliverables for Part 3

Source code for the semantic query interface (gs_semantic_query.py)
Source code for the LLM navigation agent with tool definitions
Demo video showing at least 3 different semantic navigation queries being executed
Evaluation results: object localization accuracy, open-vocab vs closed-vocab comparison, query success rate, trajectory comparison
Written analysis: strengths and limitations of using a language-embedded splat as a world model for navigation

References

Foundational

3D Gaussian Splatting for Real-Time Radiance Field Rendering, SIGGRAPH 2023
3DGS in Robotics: A Survey, comprehensive survey of 3DGS applications in robotics
Awesome 3DGS SLAM, curated list of GS-SLAM papers

GS-SLAM Systems

RTG-SLAM, real-time RGB-D Gaussian SLAM (SIGGRAPH 2024)
SplaTAM, best reconstruction quality for RGB-D (CVPR 2024)
Photo-SLAM, runs on Jetson AGX Orin (CVPR 2024)

Semantic / Language-Embedded Splats

LEGS, language-embedded Gaussian Splatting for mobile robots (IROS 2024)
LangSplat, CLIP features per Gaussian (CVPR 2024 Highlight)
Feature 3DGS, arbitrary feature fields alongside color (CVPR 2024)
HAMMER, multi-robot collaborative semantic GS with ROS (RAL 2025)

Re-localization

Splat-Loc, render-and-compare re-localization at ~25 Hz
GS-Loc, vision foundation model-driven (RAL 2025)

Frameworks and Tools

Nerfstudio, full training pipeline, Splatfacto method
gsplat, optimized PyTorch library (4x less memory)
COLMAP, Structure-from-Motion for camera pose estimation from unposed images (install guide, tutorial)
ROSplat, ROS 2 Jazzy GS visualization
Record3D, iPhone LiDAR capture app
Polycam, alternative mobile capture
Scaniverse, cross-platform mobile capture (iPhone and Android)

Summary of Deliverables

Part	Deliverables
Part 1	Trained splat of your room, evaluation metrics, experiment comparisons, written analysis
Part 2	`gs_capture.py` source + unit test, sim splat with exploration experiments, sim-to-real comparison
Part 3	Semantic query module, LLM navigation agent, demo video, evaluation results, limitations analysis

All Nerfstudio outputs should go to /data/ns_outputs (mounted from ./data/ns_outputs on the host). Always use --output-dir /data/ns_outputs to keep artifacts in the mounted volume.

Edit this page on GitHub or file an issue.

​3D Gaussian Splatting for Robot Navigation

​Introduction

​Prerequisites

​COLMAP Installation

​Background Reading

​Compute Requirements

​Part 1: Capture and Train Your First Gaussian Splat

​Task 1.1: Scene Capture

​Path A: iPhone with LiDAR (best quality)

​Path B: Any phone with Polycam/Scaniverse

​Path C: Laptop webcam or USB camera (COLMAP)

​Task 1.2: Nerfstudio Setup

​Option A: Official Nerfstudio Docker image (recommended)

​Option B: Native pip install

​Task 1.3: Train Your First Splat

​Task 1.4: Evaluate and Experiment

​Deliverables for Part 1

​Part 2: Gaussian Splatting in the ROS Pipeline

​Architecture Overview

​Task 2.1: Implement gs_capture.py

​Task 2.2: Capture and Train from Simulation

​Task 2.3: Exploration Strategy Experiments

​Task 2.4: Sim-to-Real Comparison

​Deliverables for Part 2

​Part 3: Semantic Navigation Agent with Language-Embedded Splats

​From Closed Vocabulary to Open-World Semantics

​Background: Language-Embedded Gaussian Splatting

​Task 3.1: Train a Language-Embedded Splat

​Task 3.2: Build the Semantic Query Interface

​Task 3.3: LLM-Powered Navigation Agent

​Task 3.4: Evaluation

​Deliverables for Part 3

​References

​Foundational

​GS-SLAM Systems

​Semantic / Language-Embedded Splats

​Re-localization

​Frameworks and Tools

​Summary of Deliverables

3D Gaussian Splatting for Robot Navigation

Introduction

Prerequisites

COLMAP Installation

Background Reading

Compute Requirements

Part 1: Capture and Train Your First Gaussian Splat

Task 1.1: Scene Capture

Path A: iPhone with LiDAR (best quality)

Path B: Any phone with Polycam/Scaniverse

Path C: Laptop webcam or USB camera (COLMAP)

Task 1.2: Nerfstudio Setup

Option A: Official Nerfstudio Docker image (recommended)

Option B: Native pip install

Task 1.3: Train Your First Splat

Task 1.4: Evaluate and Experiment

Deliverables for Part 1

Part 2: Gaussian Splatting in the ROS Pipeline

Architecture Overview

Task 2.1: Implement `gs_capture.py`

Task 2.2: Capture and Train from Simulation

Task 2.3: Exploration Strategy Experiments

Task 2.4: Sim-to-Real Comparison

Deliverables for Part 2

Part 3: Semantic Navigation Agent with Language-Embedded Splats

From Closed Vocabulary to Open-World Semantics

Background: Language-Embedded Gaussian Splatting

Task 3.1: Train a Language-Embedded Splat

Task 3.2: Build the Semantic Query Interface

Task 3.3: LLM-Powered Navigation Agent

Task 3.4: Evaluation

Deliverables for Part 3

References

Foundational

GS-SLAM Systems

Semantic / Language-Embedded Splats

Re-localization

Frameworks and Tools

Summary of Deliverables