Skip to main content
Open In Colab

3D Gaussian Splatting for Robot Navigation

Introduction

Traditional robotics mapping produces sparse point clouds (ORB-SLAM) or 2D occupancy grids (SLAM Toolbox). Neither captures visual appearance. A robot with an occupancy grid of a kitchen knows where the walls are but cannot answer “what does the kitchen look like from the doorway?” or “is this the same kitchen I saw before?” 3D Gaussian Splatting (3DGS) produces a dense, renderable 3D representation by fitting a collection of 3D Gaussians to a set of posed images. Each Gaussian has a position, covariance (shape), color (via spherical harmonics), and opacity. Rendering is done by splatting these Gaussians onto an image plane — no ray marching needed — enabling real-time (100+ FPS) novel view synthesis. For robotics, this enables:
  • Visual re-localization — render what the robot expects to see from a candidate pose, compare with what it actually sees
  • Scene understanding — with language-embedded splats, the 3D map becomes queryable: “where is the couch?” returns a rendered view and a pose
  • Navigation planning — dense 3D geometry from splats generates richer obstacle representations than sparse point clouds
An important note on simulation vs. reality. In Part 2 of this project you will train a Gaussian Splat from images captured by a simulated D435i camera in the Gazebo house world. 3DGS faithfully reconstructs whatever it sees — if the input images are synthetic, the output splat will look synthetic too. A Gaussian Splat cannot make Gazebo renders more photorealistic than they already are. What 3DGS does give you, even from synthetic input, is a dense, view-consistent 3D representation that supports novel view synthesis, depth rendering, and semantic queries — capabilities that occupancy grids and sparse point clouds lack. This works because 3DGS requires multi-view consistency and accurate camera poses, not photorealism. Gazebo satisfies both: its deterministic renderer produces perfectly consistent images across viewpoints, and simulated odometry provides ground-truth poses with zero drift. The original 3DGS paper evaluated on the synthetic Blender/NeRF-Synthetic dataset, and all major GS-SLAM systems (SplaTAM, GS-SLAM, RTG-SLAM) benchmark on Replica — a synthetic indoor dataset rendered with OpenGL. That said, Gazebo’s default rendering quality (Ogre2) is below Replica’s level of fidelity, so expect the reconstruction to reflect the visual quality of Gazebo’s house world, not a photorealistic scan. The sim-to-real comparison in Task 2.4 is designed to surface exactly this gap. Gazebo’s rendering pipeline does support higher-fidelity materials (PBR with normal/roughness/metalness maps, baked light maps with global illumination, and configurable sensor noise), but the default house world does not fully exploit these capabilities. This project has three parts:
PartFocusEnvironment
Part 1Capture and train your first Gaussian SplatYour own room (any camera)
Part 2Integrate GS capture into the ROS pipelineGazebo simulation (TurtleBot + D435i)
Part 3Build a semantic navigation agent on top of a language-embedded splatSimulation + LLM

Prerequisites

  • Completed Camera Calibration assignment
  • A camera for scene capture — any of the following works:
    • iPhone 12 Pro or newer (LiDAR) with Record3D — best quality, no COLMAP needed
    • Any iPhone/Android with Polycam or Scaniverse
    • Laptop webcam or USB camera — use COLMAP for pose estimation (see below)
  • Workstation with NVIDIA GPU (RTX 3060+ for preview quality, RTX 3090/4090 for full quality)
  • Docker with NVIDIA Container Toolkit installed

COLMAP Installation

COLMAP is a Structure-from-Motion (SfM) and Multi-View Stereo (MVS) pipeline that estimates camera poses from unposed images. It is required if you are using a monocular camera (laptop webcam, USB camera, or phone without LiDAR). Option A — Inside the Nerfstudio Docker container (recommended): COLMAP is pre-installed in the Nerfstudio Docker image. No additional setup needed. Run ns-process-data images inside the container and it will call COLMAP automatically. Option B — Native installation on Ubuntu:
sudo apt-get install colmap
For GPU-accelerated feature matching (significantly faster on large image sets), build from source with CUDA support. Follow the official build instructions. Option C — pip (CPU-only, slower but simplest):
pip install pycolmap
Nerfstudio’s ns-process-data images command wraps COLMAP and handles the full SfM pipeline for you — you do not need to call COLMAP directly.

Background Reading

Before starting, study these resources in order:
  1. Conceptual overview: Introduction to 3D Gaussian Splatting (Hugging Face blog) — accessible explanation of the theory
  2. Original paper: 3D Gaussian Splatting for Real-Time Radiance Field Rendering (SIGGRAPH 2023) — the foundational work
  3. Robotics survey: 3DGS in Robotics: A Survey — how 3DGS is being used for SLAM, navigation, and manipulation
  4. Nerfstudio documentation: Splatfacto method — the training framework you will use

Compute Requirements

TaskVRAMGPUTime
Training splatfacto, 7K iterations (preview)12-16 GBRTX 3060+~10 min
Training splatfacto, 30K iterations (full)24 GBRTX 3090/409030-40 min
Training with gsplat (optimized)6 GB (4x less)RTX 3090/409015% faster
Rendering trained splatAny CUDA GPU100-200+ FPS
COLMAP SfM (200 images)2-4 GBCPU or CUDA GPU5-30 min

Part 1: Capture and Train Your First Gaussian Splat

In this part you will capture a real-world scene (your room, living space, or lab) using any available camera and train a Gaussian Splat from the captured data. The goal is to build intuition for how capture quality affects reconstruction quality before moving to the robotic pipeline.

Task 1.1: Scene Capture

Choose a room-scale scene (3-6 meters across). Good scenes have varied geometry and texture — avoid blank white walls. Follow these capture best practices regardless of camera type:
  • Walk slowly — fast motion causes blur and tracking loss
  • Maintain 70-80% overlap between consecutive frames
  • Capture from multiple heights (standing, crouching) to cover vertical surfaces
  • Cover the full scene — gaps in coverage become holes in reconstruction
  • Aim for 200-500 frames for a single room

Path A: iPhone with LiDAR (best quality)

If you have an iPhone 12 Pro or newer:
  1. Install Record3D from the App Store.
  2. Record a walkthrough of the scene.
  3. Export as “EXR + JPG sequence”.
  4. Also export “Zipped PLY point clouds” — these provide a LiDAR-based point cloud that dramatically improves Gaussian initialization.
  5. Transfer both exports to your workstation.
Processing (inside the Nerfstudio container):
ns-process-data record3d \
  --data /data/captures/my_room_r3d \
  --ply /data/captures/my_room_ply \
  --output-dir /data/captures/my_room_processed \
  --voxel-size 0.5
No COLMAP needed — poses come from ARKit.

Path B: Any phone with Polycam/Scaniverse

If you have any iPhone or Android phone:
  1. Install Polycam or Scaniverse.
  2. Use the app’s photo mode to capture the scene (follow the overlap guidelines above).
  3. Export the capture and transfer to your workstation.
Processing:
# Polycam export
ns-process-data polycam --data /data/captures/polycam_export --output-dir /data/captures/my_room_processed

# Scaniverse / generic images
ns-process-data images --data /data/captures/my_room_images --output-dir /data/captures/my_room_processed

Path C: Laptop webcam or USB camera (COLMAP)

If you only have a laptop webcam or USB camera:
  1. Capture images: Record a video of your room, then extract frames, or take individual photos while walking around the scene. For video extraction:
    # Extract 1 frame per second from video (adjust -r for density)
    ffmpeg -i my_room_video.mp4 -r 1 -q:v 2 /data/captures/my_room_images/frame_%04d.jpg
    
    Alternatively, write a short Python script using OpenCV to capture frames from the webcam at regular intervals as you walk around the room.
  2. Use your calibration: If you completed the Camera Calibration assignment, you can provide your camera intrinsics to improve COLMAP’s accuracy. Otherwise, COLMAP will estimate intrinsics automatically (less accurate, especially for wide-angle laptop cameras with significant distortion).
  3. Process with COLMAP (inside the Nerfstudio container — COLMAP is pre-installed):
    ns-process-data images \
      --data /data/captures/my_room_images \
      --output-dir /data/captures/my_room_processed
    
    This runs the full COLMAP SfM pipeline: feature extraction (SIFT), feature matching, sparse reconstruction, and undistortion. Expect 5-30 minutes depending on image count and whether GPU-accelerated matching is available.
  4. Troubleshooting COLMAP failures: If COLMAP fails to reconstruct (common causes: insufficient overlap, textureless surfaces, motion blur):
    • Check that at least 70% of image pairs have matched features — ns-process-data will report the number of registered images
    • Re-capture with slower movement and more overlap
    • Try --feature-type superpoint --matcher-type superglue for more robust matching (requires the SuperPoint/SuperGlue models)
Comparison of capture paths:
PathPoses fromDepth availablePoint cloud initCOLMAP neededQuality
A: Record3D (LiDAR)ARKitYes (LiDAR)Yes (PLY export)NoBest
B: Polycam/ScaniverseApp SfM or ARKitSometimesSometimesSometimesGood
C: Webcam + COLMAPCOLMAP SfMNoSfM sparse pointsYesAdequate
Path C produces RGB-only splats without depth supervision. This is perfectly valid — the original 3DGS paper used RGB-only with COLMAP poses. You will lose depth supervision (one of the experiment variables in Task 1.3), and initialization will use COLMAP’s sparse point cloud instead of dense LiDAR points.

Task 1.2: Nerfstudio Setup

You need a working Nerfstudio environment with GPU support. Choose one of the following options: The Nerfstudio project publishes an official Docker image on GitHub Container Registry that includes Nerfstudio, COLMAP, PyTorch, and CUDA pre-configured.
docker run --gpus all \
  -u $(id -u) \
  -v $(pwd)/data/captures:/data/captures \
  -v $(pwd)/data/ns_outputs:/data/ns_outputs \
  -v $HOME/.cache:/home/user/.cache \
  -p 7007:7007 \
  --rm -it \
  --shm-size=12gb \
  ghcr.io/nerfstudio-project/nerfstudio:latest
Specific version tags are also available (e.g., ghcr.io/nerfstudio-project/nerfstudio:1.1.3). See the GHCR package page for all available tags. If you prefer to build the image yourself (e.g., to target a specific CUDA architecture):
git clone https://github.com/nerfstudio-project/nerfstudio.git
cd nerfstudio
docker build \
  --build-arg CUDA_ARCHITECTURES=86 \
  --tag nerfstudio-local \
  --file Dockerfile .
See the official Dockerfile for available build arguments.

Option B: Native pip install

If you prefer a local installation without Docker:
# Create a conda environment
conda create -n nerfstudio python=3.10 -y
conda activate nerfstudio

# Install PyTorch with CUDA (check https://pytorch.org for your CUDA version)
pip install torch torchvision --index-url https://download.pytorch.org/whl/cu118

# Install Nerfstudio (includes gsplat)
pip install nerfstudio

# Verify
ns-train --help
See the Nerfstudio installation guide for full details, including COLMAP installation for your platform.

Task 1.3: Train Your First Splat

Inside your Nerfstudio environment (Docker or native), train splatfacto on your processed data:
ns-train splatfacto \
  --data /data/captures/my_room_processed \
  --output-dir /data/ns_outputs

# Open localhost:7007 in browser to watch training live
Refer to the Nerfstudio custom data guide and the step-by-step walkthrough for detailed instructions.

Task 1.4: Evaluate and Experiment

Once your first splat is trained, run evaluation and begin experimenting:
  1. Evaluate your trained splat using Nerfstudio’s built-in metrics:
    ns-eval --load-config /data/ns_outputs/<run>/config.yml \
      --output-path /data/ns_outputs/<run>/eval_results.json
    
    Report PSNR, SSIM, and LPIPS on the held-out test views.
  2. Export the trained splat for external viewing:
    ns-export gaussian-splat \
      --load-config /data/ns_outputs/<run>/config.yml \
      --output-dir /data/ns_outputs/<run>/export
    
  3. Experiment with at least two of the following variations and compare results:
ExperimentWhat to varyWhat to measureNotes
Training duration7K vs 15K vs 30K iterationsPSNR/SSIM, training time, visual qualityAll paths
Method variantsplatfacto vs splatfacto-bigQuality metrics, VRAM usage, training timeAll paths
Depth supervisionWith vs without depth_file_path in transforms.jsonGeometric accuracy, floatersPath A/B only (requires depth)
Point cloud initWith vs without --ply during processingConvergence speed, final qualityPath A only (requires LiDAR PLY)
Quality tuningDefault vs tuned (--pipeline.model.cull_alpha_thresh=0.005 --pipeline.model.continue_cull_post_densification=False --pipeline.model.use_scale_regularization=True)Floaters, edge qualityAll paths
Capture densityVary frame extraction rate (e.g., 1 FPS vs 3 FPS vs 5 FPS)Quality vs training timePath C (video extraction)
  1. Capture a second scene — a different room or area — and train a splat. You will use both scenes to discuss coverage vs. quality tradeoffs in your report.

Deliverables for Part 1

  • Screenshots or screen recordings of the Nerfstudio viewer showing your trained splats
  • Evaluation metrics (PSNR, SSIM, LPIPS) for each experiment variation
  • Written analysis: what capture choices and training configurations affected quality and why

Part 2: Gaussian Splatting in the ROS Pipeline

In Part 1 you captured real scenes with a physical camera. Now you will integrate Gaussian Splatting into the TurtleBot3 simulation pipeline, where RGB-D images and poses come from Gazebo via ROS 2 and Zenoh. The resulting splat will be a faithful 3D reconstruction of the Gazebo house world — it will look as realistic (or as synthetic) as the Gazebo renderer itself. The value is not photorealism but the representation: a dense, renderable, queryable 3D map that an occupancy grid or sparse point cloud cannot provide. Gazebo’s deterministic renderer and ground-truth odometry give you ideal conditions for 3DGS (perfect multi-view consistency, zero pose drift), letting you focus on the capture pipeline and exploration strategy rather than sensor noise.

Architecture Overview

The capture pipeline follows the same Zenoh subscriber pattern as the existing object_detector.py:
Gazebo House World (D435i sim)
    |
    | ros_gz_bridge → zenoh-bridge-ros2dds
    v
gs_capture.py (Zenoh subscriber)
    |
    | Keyframe gating (distance + angle)
    | ROS → Nerfstudio coordinate transform
    | Depth m → mm conversion
    v
data/captures/house_run/
    ├── transforms.json      (intrinsics + per-frame extrinsics)
    ├── images/              (RGB PNGs)
    └── depth/               (16-bit depth PNGs in mm)
    |
    v
ns-train splatfacto --data <path>
    |
    v
Trained Gaussian Splat of the House World

Task 2.1: Implement gs_capture.py

Write a Zenoh capture script (detector/gs_capture.py) that subscribes to the robot’s RGB, depth, and odometry streams and writes keyframes in Nerfstudio format. Data sources (Zenoh keys):
Zenoh KeyROS TopicDataFrame
camera/color/image_raw/camera/color/image_rawRGB image (sensor_msgs/Image, CDR)camera_optical_frame
camera/depth/image_rect_raw/camera/depth/image_rect_rawDepth image (float32 meters, CDR)camera_depth_frame
odom/odomRobot odometry (nav_msgs/Odometry, CDR)odom → base_link
Key design decisions you must implement:
  1. Frame synchronization: Capture is RGB-triggered. When an RGB frame arrives, pair it with the most recent depth and odom. Drop the frame if depth or odom is more than 200 ms stale.
  2. Keyframe gating: Same logic as object_detector.py — only capture when the robot has moved more than a distance threshold or rotated more than an angle threshold since the last keyframe. Expose these as CLI arguments (defaults: 0.3 m, 10 degrees).
  3. Coordinate transform: This is the most critical part. Nerfstudio’s transform_matrix is camera-to-world in OpenGL convention (x-right, y-up, z-backward). ROS uses x-forward, y-left, z-up for the robot body. The transform chain is:
    T_nerfstudio = T_odom_baselink @ T_baselink_camera @ T_ros_optical_to_nerfstudio
    
    Where:
    • T_odom_baselink — from the /odom message (position + quaternion → 4x4 matrix)
    • T_baselink_camera — static offset from URDF: translation [0.064, -0.065, 0.094], no rotation
    • T_ros_optical_to_nerfstudio — 180-degree rotation around x-axis (flips y and z)
    Study the worked numeric example in the design spec to verify your implementation. An incorrect transform produces mirrored or inverted reconstructions.
  4. Depth conversion: Gazebo produces float32 meters. Nerfstudio expects 16-bit PNG in millimeters:
    depth_mm = (depth_m * 1000.0).clip(0, 65535).astype(np.uint16)
    
  5. Missing odom handling: If no odom has been received, capture pauses. Frames without valid poses are never written. Log a warning if odom is missing for more than 10 seconds.
  6. Output format: Write transforms.json with camera intrinsics (from the simulated D435i: 320x240, fx=fy=277.13, cx=160, cy=120) and per-frame entries with file_path, depth_file_path, and transform_matrix.
Reference: Study object_detector.py for the Zenoh subscription pattern, CDR deserialization, and keyframe gating logic. Your capture script follows the same structure but writes a different output format.

Task 2.2: Capture and Train from Simulation

Use the capture script to collect data from the Gazebo house world:
# 1. Launch the simulation stack
docker compose up -d demo-world-house zenoh-router zenoh-bridge gs-capture

# 2. Drive the robot through the house
#    Use teleop or Nav2 waypoints to explore the environment
#    Aim for systematic coverage — visit every room

# 3. Stop capture
docker compose stop gs-capture

# 4. Train (inside nerfstudio container)
docker exec -it turtlebot-maze-nerfstudio-1 bash
ns-train splatfacto \
  --data /data/captures/house_run \
  --output-dir /data/ns_outputs

# 5. View the trained splat
ns-viewer --load-config /data/ns_outputs/house_run/splatfacto/<timestamp>/config.yml

Task 2.3: Exploration Strategy Experiments

The quality of a Gaussian Splat depends heavily on how well the scene was covered during capture. Run at least two of these experiments:
ExperimentDescription
Teleop vs Nav2 waypointsCompare manual exploration with pre-planned waypoint sequences. Which gives better coverage?
Dense vs sparse captureCompare --keyframe-dist 0.1 --keyframe-angle 5 with --keyframe-dist 1.0 --keyframe-angle 30. How does frame count affect quality and training time?
Partial vs full coverageTrain a splat from a single room capture vs. the full house. Where do reconstruction holes appear?
Depth supervisionRemove depth_file_path entries from transforms.json and retrain. Compare geometric accuracy and floater artifacts.

Task 2.4: Sim-to-Real Comparison

Compare your simulation splat (Part 2) with your real-world splat (Part 1). This comparison is designed to make the sim-real gap concrete and measurable.
  1. Render 5 novel views from random poses in each trained splat.
  2. Qualitatively compare: texture detail, geometric accuracy, lighting, artifacts.
  3. Discuss:
    • What are the fundamental differences between simulated and real-world captures?
    • Where does the Gazebo splat faithfully reproduce geometry but fail on visual realism?
    • What would need to change in the Gazebo world (materials, lighting, sensor noise) to narrow the gap?
    • How does the sim-real gap affect downstream tasks like semantic queries (Part 3)?

Deliverables for Part 2

  • Source code for gs_capture.py with inline comments explaining the coordinate transform
  • A unit test that verifies the coordinate transform: given a known ROS pose, assert the expected Nerfstudio matrix
  • Evaluation metrics and screenshots for each exploration experiment
  • Written sim-to-real comparison analysis

Part 3: Semantic Navigation Agent with Language-Embedded Splats

From Closed Vocabulary to Open-World Semantics

In previous assignments, the robot’s object detection pipeline used COCO-trained models (YOLO, Faster R-CNN) that recognize a fixed set of 80 object classes. If “refrigerator” happens to be one of those 80 classes, the detector finds it. If you ask for “the red mug on the counter” or “the fire extinguisher” — objects outside the training vocabulary — the detector is blind. Language-embedded Gaussian Splatting changes this fundamentally. By embedding CLIP features into each Gaussian, the 3D map becomes queryable with arbitrary natural language — not just a fixed class list. CLIP was trained on 400 million image-text pairs from the internet, so it encodes a broad understanding of visual concepts. A language-embedded splat can answer queries like “wooden bookshelf,” “potted plant near the window,” or “the area that looks like a kitchen” — none of which would match a COCO class label. This is the shift from closed-vocabulary detection (what the model was trained on) to open-vocabulary scene understanding (what language can describe). Combined with an LLM agent, it enables navigation commands in natural language rather than explicit coordinates.

Background: Language-Embedded Gaussian Splatting

Several recent works embed CLIP or DINO features into each Gaussian, creating a 3D scene representation that is both visually renderable and semantically queryable:
MethodKey IdeaSpeedReference
LEGS (Berkeley)Built on Nerfstudio Splatfacto; runs on mobile robot; 66% open-vocab accuracy3.5x faster training than LERFIROS 2024
LangSplatScene-wise language autoencoder; CLIP features in latent space per Gaussian199x faster than LERFCVPR 2024
LangSplatV2Global 3D codebook for high-dim features450+ FPS renderingNeurIPS 2025
LEGaussiansQuantized CLIP + DINO features as discrete indices per GaussianReal-time renderingCVPR 2024
Feature 3DGSLearns arbitrary feature fields (CLIP, SAM, etc.) alongside colorReal-time renderingCVPR 2024
Recommendation: Start with LEGS as it is built directly on Nerfstudio’s Splatfacto (which you already used in Parts 1-2) and is designed for mobile robot scenarios. If you want higher semantic quality for offline analysis, explore LangSplat.

Task 3.1: Train a Language-Embedded Splat

Using your house world capture from Part 2, train a language-embedded Gaussian Splat.
  1. Study the LEGS pipeline: Read the LEGS paper and its Nerfstudio integration. LEGS extends Splatfacto by adding a per-Gaussian language feature vector trained alongside the visual features. At query time, you render the language features from any viewpoint and compare with a text query’s CLIP embedding.
  2. Generate CLIP annotations: Before training, you need to generate per-image CLIP feature maps from your captured RGB images. LEGS provides scripts for this. The annotations are used as supervision during training alongside the photometric loss.
  3. Train the language-embedded splat:
    # Exact command depends on LEGS version — consult the LEGS README
    # The general pattern:
    ns-train legs-splatfacto \
      --data /data/captures/house_run \
      --output-dir /data/ns_outputs
    
  4. Test open-vocabulary queries: Once trained, query the splat with natural language. Start with queries that would work with a COCO detector, then go beyond: COCO-equivalent queries (baseline):
    • “refrigerator”
    • “couch”
    • “dining table”
    Open-vocabulary queries (the new capability):
    • “kitchen counter”
    • “the doorway to the bedroom”
    • “something to sit on”
    • “area with appliances”
    For each query, the system should return a relevancy heatmap rendered from one or more viewpoints, highlighting which Gaussians match the text query. Record the top-1 location (3D centroid of the highest-relevance region) for each query. Compare the open-vocabulary results with what a COCO detector could have found — this is the concrete evidence of the capability shift. Note: CLIP features are learned from image appearance. On Gazebo-rendered scenes, semantic accuracy will depend on how recognizable the simulated objects are to CLIP (which was trained on real-world images). Objects with distinctive shapes (refrigerator, couch) may work better than those that rely on realistic textures.

Task 3.2: Build the Semantic Query Interface

Build a Python module that takes a trained language-embedded splat and a text query and returns:
  1. Object location — the 3D position (x, y, z) of the queried object in the world frame
  2. Relevancy map — a rendered image showing where the query matches in the scene
  3. Confidence score — cosine similarity between the query embedding and the best-matching region
The interface should support queries like:
from gs_semantic_query import SemanticSplatQuery

query_engine = SemanticSplatQuery(
    model_config="/data/ns_outputs/house_run/legs-splatfacto/<timestamp>/config.yml"
)

result = query_engine.query("refrigerator")
# result.position  → np.array([x, y, z]) in odom frame
# result.relevancy_image → np.ndarray (H, W, 3)
# result.confidence → float (0-1)

Task 3.3: LLM-Powered Navigation Agent

Build an agent that accepts natural language navigation commands and translates them into robot trajectories using the semantic splat as a world model. Architecture:
User prompt
  "show me all the top trajectories that reach the refrigerator location"
    |
    v
LLM Agent (Claude / GPT-4)
    |
    | 1. Parse intent → target object = "refrigerator"
    | 2. Query semantic splat → get 3D location
    | 3. Get current robot pose from /odom
    | 4. Query Nav2 planner for candidate paths
    | 5. Rank trajectories by criteria
    | 6. Render expected views along top trajectories from splat
    v
Response with trajectory visualizations
The agent should support at minimum these types of queries:
Query TypeExampleExpected Behavior
Navigate to object”go to the refrigerator”Resolve object location via semantic splat, plan path via Nav2, execute
Show trajectories”show me all the top trajectories that reach the refrigerator location”Resolve location, compute multiple candidate paths, render expected views along each path from the splat, return ranked results
Scene query”what objects are near the dining table?”Query multiple object types, compute spatial relationships, return natural language description
Exploration”explore the area around the kitchen”Resolve “kitchen” region, generate waypoints for coverage, optionally extend the splat with new captures
Implementation guidance:
  1. LLM tool interface: Define the agent’s tools as functions the LLM can call:
    • query_object_location(object_name: str) → \{position, confidence\}
    • get_robot_pose() → \{x, y, z, yaw\}
    • plan_path(start, goal) → list[Pose]
    • render_view_from_splat(pose) → Image
    • send_nav2_goal(x, y, yaw) → status
  2. Trajectory ranking: For the “show me top trajectories” query, consider ranking by:
    • Path length (shorter is better)
    • Clearance from obstacles (safer paths)
    • Visual coverage (paths that pass through well-reconstructed areas of the splat, where the robot can verify its position)
    • Semantic relevance (paths that pass by related objects)
  3. Splat-based visualization: For each candidate trajectory, render what the robot would see at key waypoints along the path using the trained splat. This gives the user a preview of the journey before committing to execution.
  4. ROS integration: The agent communicates with the robot through the existing ROS 2 / Zenoh infrastructure:
    • Read current pose from /odom
    • Send goals via Nav2’s NavigateToPose action
    • Monitor progress via Nav2 feedback
Research pointers for implementation:
  • Splat-Loc (Stanford) — render-and-compare re-localization at ~25 Hz against a pre-built GS map. Relevant for verifying the robot’s position against the splat during trajectory execution.
  • GS-Loc (RAL 2025) — vision foundation model-driven re-localization using Gaussian Splats.
  • HAMMER (arxiv.org/abs/2501.14147) — multi-robot collaborative semantic GS with ROS communication. Shows how to architect the ROS-to-splat interface.
  • ROSplat (github.com/shadygm/ROSplat) — ROS 2 Jazzy package for online GS visualization with custom GaussianArray messages.

Task 3.4: Evaluation

Evaluate your semantic navigation agent:
  1. Object localization accuracy: For 5 objects in the house world, compare the semantic splat’s estimated position with the ground-truth position from the Gazebo world file. Report the position error in meters.
  2. Open-vocabulary vs. closed-vocabulary: For 10 queries (5 within COCO vocabulary, 5 outside it), compare:
    • Does the COCO detector find the object? (Yes/No)
    • Does the semantic splat find the object? (Yes/No, with confidence score)
    • Position accuracy for objects both systems can detect
  3. Query success rate: Issue 10 natural language navigation queries of varying complexity. For each query, record:
    • Did the agent correctly parse the intent?
    • Did the semantic splat return a valid location?
    • Did the planned trajectory successfully reach the goal?
    • Were the rendered splat views useful for previewing the trajectory?
  4. Trajectory quality: For the “show me top trajectories” query, compare the top-3 trajectories by path length, estimated clearance, and visual quality of the rendered waypoint views.

Deliverables for Part 3

  • Source code for the semantic query interface (gs_semantic_query.py)
  • Source code for the LLM navigation agent with tool definitions
  • Demo video showing at least 3 different semantic navigation queries being executed
  • Evaluation results: object localization accuracy, open-vocab vs closed-vocab comparison, query success rate, trajectory comparison
  • Written analysis: strengths and limitations of using a language-embedded splat as a world model for navigation

References

Foundational

GS-SLAM Systems

  • RTG-SLAM — real-time RGB-D Gaussian SLAM (SIGGRAPH 2024)
  • SplaTAM — best reconstruction quality for RGB-D (CVPR 2024)
  • Photo-SLAM — runs on Jetson AGX Orin (CVPR 2024)

Semantic / Language-Embedded Splats

  • LEGS — language-embedded Gaussian Splatting for mobile robots (IROS 2024)
  • LangSplat — CLIP features per Gaussian (CVPR 2024 Highlight)
  • Feature 3DGS — arbitrary feature fields alongside color (CVPR 2024)
  • HAMMER — multi-robot collaborative semantic GS with ROS (RAL 2025)

Re-localization

  • Splat-Loc — render-and-compare re-localization at ~25 Hz
  • GS-Loc — vision foundation model-driven (RAL 2025)

Frameworks and Tools

  • Nerfstudio — full training pipeline, Splatfacto method
  • gsplat — optimized PyTorch library (4x less memory)
  • COLMAP — Structure-from-Motion for camera pose estimation from unposed images (install guide, tutorial)
  • ROSplat — ROS 2 Jazzy GS visualization
  • Record3D — iPhone LiDAR capture app
  • Polycam — alternative mobile capture
  • Scaniverse — cross-platform mobile capture (iPhone and Android)

Summary of Deliverables

PartDeliverables
Part 1Trained splat of your room, evaluation metrics, experiment comparisons, written analysis
Part 2gs_capture.py source + unit test, sim splat with exploration experiments, sim-to-real comparison
Part 3Semantic query module, LLM navigation agent, demo video, evaluation results, limitations analysis
All Nerfstudio outputs should go to /data/ns_outputs (mounted from ./data/ns_outputs on the host). Always use --output-dir /data/ns_outputs to keep artifacts in the mounted volume.