This part is optional. It builds on the work from Parts 1-2 and is intended for students who want to explore open-vocabulary scene understanding and LLM-powered navigation on top of their trained Gaussian Splats.
Introduction
In Assignment 1 you trained a standard 3D Gaussian Splat using gsplat’ssimple_trainer on both a real-world scene (your room) and a simulated scene (the Gazebo house world). That splat captures geometry and appearance, it can render novel views and represent free space, but it has no understanding of what is in the scene. A refrigerator and a bookshelf are just collections of colored Gaussians.
In this assignment you will extend your trained splat with language understanding, so that every Gaussian carries not only position, color, and opacity but also a semantic feature vector. The 3D map becomes queryable with arbitrary natural language, a capability that fundamentally changes what a robot can do with its world model.
From closed vocabulary to open-world semantics
Traditional object detection in robotics relies on models trained on fixed class sets. A COCO-trained YOLO detector recognizes 80 object classes. If “refrigerator” is one of those 80 classes, the detector finds it. If you ask for “the red mug on the counter,” “the fire extinguisher,” or “the area that looks like a kitchen”, concepts outside the training vocabulary, the detector is blind. CLIP (Contrastive Language-Image Pre-training) changes this equation. Trained on 400 million image-text pairs from the internet, CLIP encodes a broad visual-semantic understanding into a shared embedding space where images and text are directly comparable. A CLIP feature vector for a patch of an image can be compared with the CLIP encoding of any text string using cosine similarity, no fixed class list required. Language-embedded Gaussian Splatting takes this one step further: each Gaussian in the 3D scene gets an additional learnable feature vector (a CLIP embedding) alongside its geometric and appearance attributes. During training, these features are supervised by dense CLIP feature maps extracted from the training images. At query time, you render the per-Gaussian features from any viewpoint and compare them with a text query’s CLIP embedding. The result is a heatmap showing where in the scene the queried concept appears. This enables queries no fixed-class detector can handle:- “where is the refrigerator?” (COCO-equivalent)
- “the area that looks like a kitchen” (spatial/semantic region)
- “something to sit on” (functional description)
- “the doorway to the bedroom” (relational/contextual)
Key insight: gsplat already supports feature rendering
gsplat’s rasterizer is not limited to rendering RGB. Therasterization function accepts per-Gaussian colors of shape [N, D] where D can be any channel count, 3 for RGB, 512 for CLIP embeddings, or any compressed dimension in between. This is the foundation both implementation paths in this assignment build on. The rasterizer alpha-composites whatever per-Gaussian vectors you provide, just as it does for color.
Recent methods
| Method | Key Idea | Speed | Reference |
|---|---|---|---|
| Feature-3DGS | N-dimensional rasterizer for arbitrary foundation model features | Real-time | CVPR 2024 |
| LangSplat | Scene-wise language autoencoder, CLIP features in latent space | 199x faster than LERF | CVPR 2024 |
| LangSplatV2 | Global 3D codebook for high-dim features | 450+ FPS | NeurIPS 2025 |
| LEGaussians | Quantized CLIP + DINO per Gaussian | Real-time | CVPR 2024 / TPAMI 2025 |
| OpenGaussian | Point-level open-vocabulary understanding | Real-time | NeurIPS 2024 |
Prerequisites
- Completed Assignment 1 (trained gsplat model of your room and the Gazebo house world, with COLMAP data)
- Familiarity with CLIP (OpenAI CLIP paper, HuggingFace transformers CLIP)
- GPU with 16+ GB VRAM (language features increase memory requirements)
- Python 3.10+, PyTorch 2.1+, CUDA 11.8+
Choose Your Implementation Path
This assignment offers two implementation paths. Both use the same COLMAP-format data from Assignment 1 and produce the same deliverables. Choose the path that matches your goals.Path A: Feature-3DGS (Recommended for most students)
Use the Feature-3DGS codebase, which provides a complete pipeline for embedding foundation model features into Gaussian Splats.- What you get: Feature extraction scripts, an N-dimensional Gaussian rasterizer, training pipeline, and query utilities, all working out of the box with COLMAP data
- Architecture: Extends the standard 3DGS rasterizer to render N-dimensional feature vectors alongside RGB. Distills 2D foundation model features (SAM, CLIP-LSeg) into per-Gaussian feature attributes
- Dependencies: PyTorch 2.4+, CUDA 11.8+
- Tradeoff: You use an existing, well-tested codebase, faster path to results, but less low-level understanding of the rendering internals
Path B: Build from Scratch on gsplat (Advanced)
Build the entire language embedding pipeline yourself on top of gsplat.- What you build: CLIP feature extraction from training images, per-Gaussian feature training with a modified
simple_trainer, text query interface - Foundation: gsplat’s
rasterizationalready supports arbitrary-dimension per-Gaussian features, you implement everything above that primitive - Tradeoff: Significantly more work, but you gain deep understanding of how language-embedded splats work and full flexibility to experiment
Path A: Feature-3DGS
Task A.1: Setup
Clone and install the Feature-3DGS codebase:- Extracts dense 2D feature maps from each training image using foundation models (SAM, CLIP-LSeg)
- Assigns each Gaussian a learnable feature vector of dimension D
- Renders these feature vectors using the N-dimensional rasterizer
- Supervises with a feature distillation loss: MSE between rendered features and the ground-truth 2D feature maps
- Total loss = photometric RGB loss + feature distillation loss
Task A.2: Extract Foundation Model Features
Before training, you need per-image dense feature maps as supervision targets. Feature-3DGS supports multiple foundation models. For this assignment, extract CLIP-based features:- SAM features, Segment Anything masks for each training image, providing instance-level boundaries
- CLIP-LSeg features, per-pixel language-grounded features from LSeg (Language-driven Semantic Segmentation)
Note: Exact script names and arguments depend on the Feature-3DGS version, consult their README for the current API. The data directory should point to your COLMAP output from Assignment 1 (containingThe extraction step processes each training image through the foundation models and saves dense feature maps (typically assparse/0/cameras.bin,images.bin,points3D.bin, and theimages/folder).
.npy or .pt files) that will be loaded during training.
Task A.3: Train Feature-Embedded Gaussians
Train the language-embedded Gaussian Splat using your COLMAP data and the extracted features:- RGB PSNR, should be comparable to your Assignment 1 splat (the feature head should not degrade visual quality)
- Feature loss, should decrease steadily, indicating the per-Gaussian features are learning to reproduce the 2D CLIP maps
Task A.4: Query with Text
Once trained, you can render semantic feature maps from arbitrary viewpoints and compare them with text queries:Path B: Build from Scratch on gsplat
This is the advanced path. You will build the entire language embedding pipeline on top of gsplat’s rasterization primitive.Task B.1: Understand gsplat’s Feature Rendering
gsplat’srasterization function accepts per-Gaussian colors of arbitrary dimension D. Normally D=3 (RGB), but you can pass D=512 (full CLIP embedding), D=64 (compressed), or any other dimension. The rasterizer alpha-composites whatever vectors you provide, producing a (H, W, D) output image.
This is the key API:
colors argument changes. This means you can share the Gaussian geometry from your Assignment 1 model and simply add a new learnable feature attribute.
Task B.2: Extract CLIP Features from Training Images
For each training image, you need dense per-pixel CLIP features as supervision targets.- For each training image in your COLMAP dataset, extract dense CLIP features using the function above
- Save the feature maps to disk (they are large, 768 floats per pixel)
- Optional but recommended: Compress features with PCA or a small autoencoder to reduce dimensionality from 768 to 64 or 128. This dramatically reduces memory (768 floats per Gaussian vs. 64) and speeds up rasterization. LangSplat uses a scene-wise autoencoder for this purpose
Task B.3: Modify the Training Loop
Starting from gsplat’ssimple_trainer.py, you need to add four things:
- Per-Gaussian learnable feature vector:
- Feature rendering pass (in addition to the RGB pass):
- Feature distillation loss:
- Combined loss:
- The feature rendering pass uses the same geometry (means, quats, scales, opacities) as RGB rendering. Gradients from the feature loss will flow back through geometry parameters, which can help or hurt RGB quality depending on
lambda_feat - Densification and pruning from the base trainer still apply, new Gaussians created by densification need their feature vectors initialized (e.g., copy from the parent Gaussian)
- Memory: with
N=200,000Gaussians andfeature_dim=64, the feature parameter adds ~50 MB. Withfeature_dim=768it adds ~600 MB
Task B.4: Query with Text
Build a query function that encodes text with CLIP and compares against rendered features:Task 3: Build the Semantic Query Interface
This task is the same for both paths. Build a Python module (gs_semantic_query.py) that provides a clean interface for querying your trained language-embedded splat. The module should accept a text query and return structured results:
-
Object localization: The query should not just produce a heatmap, it should return a 3D world-frame position. To do this:
- Render feature maps from multiple viewpoints
- Identify high-similarity regions in each view
- Back-project the high-similarity pixels to 3D using the known camera parameters and depth (which you can render from the splat)
- Aggregate the 3D points into a centroid
- Relevancy visualization: Generate a color-coded heatmap overlaid on the rendered RGB image, showing where the query matches.
- Confidence scoring: Report the maximum cosine similarity as a confidence score. Values above 0.25 typically indicate a meaningful match; below 0.15 is usually noise.
- “refrigerator”
- “couch”
- “dining table”
- “chair”
- “tv monitor”
- “kitchen counter”
- “the doorway to the bedroom”
- “something to sit on”
- “area with appliances”
- “wooden bookshelf”
Note about Gazebo: CLIP was trained on real-world images. On simulated scenes, semantic accuracy depends on how recognizable the simulated objects are to CLIP. Objects with distinctive shapes (refrigerator, couch) typically work better than those that rely on photorealistic textures. Document any cases where the simulation gap affects results.
Task 4: LLM-Powered Navigation Agent
This task is the same for both paths. Build an agent that accepts natural language navigation commands and translates them into robot trajectories using the semantic splat as a world model.Architecture
Supported query types
The agent should support at minimum these types of queries:| Query Type | Example | Expected Behavior |
|---|---|---|
| Navigate to object | ”go to the refrigerator” | Resolve object location via semantic splat, plan path via Nav2, execute |
| Show trajectories | ”show me all the top trajectories that reach the refrigerator location” | Resolve location, compute multiple candidate paths, render expected views along each path from the splat, return ranked results |
| Scene query | ”what objects are near the dining table?” | Query multiple object types, compute spatial relationships, return natural language description |
| Exploration | ”explore the area around the kitchen” | Resolve “kitchen” region, generate waypoints for coverage, optionally extend the splat with new captures |
Implementation guidance
1. LLM tool interface: Define the agent’s tools as functions the LLM can call:- Path length, shorter is better
- Clearance from obstacles, safer paths score higher
- Visual coverage, paths through well-reconstructed areas of the splat (where the robot can verify its position via render-and-compare localization)
- Semantic relevance, paths that pass by related objects (e.g., a path to the refrigerator that goes through the kitchen)
- Read current pose from
/odom - Send goals via Nav2’s
NavigateToPoseaction - Monitor progress via Nav2 feedback
Research pointers
These papers are directly relevant to implementing the navigation agent:- Splat-Loc (arxiv.org/abs/2312.02126), render-and-compare re-localization at ~25 Hz against a pre-built Gaussian Splat map. Relevant for verifying the robot’s position against the splat during trajectory execution.
- GS-Loc (RAL 2025), vision foundation model-driven re-localization using Gaussian Splats.
- HAMMER (arxiv.org/abs/2501.14147), multi-robot collaborative semantic Gaussian Splatting with ROS communication. Shows how to architect the ROS-to-splat interface.
- ROSplat (github.com/shadygm/ROSplat), ROS 2 Jazzy package for online Gaussian Splat visualization with custom
GaussianArraymessages.
Task 5: Evaluation
This task is the same for both paths. Evaluate your semantic navigation agent across four dimensions:5.1 Object localization accuracy
For 5 objects in the Gazebo house world, compare the semantic splat’s estimated 3D position with the ground-truth position from the Gazebo world file (.world or .sdf).
| Object | Ground Truth (x, y, z) | Estimated (x, y, z) | Error (m) |
|---|---|---|---|
| refrigerator | |||
| couch | |||
| dining table | |||
| … |
5.2 Open-vocabulary vs. closed-vocabulary
For 10 queries (5 within COCO vocabulary, 5 outside it), compare:| Query | In COCO? | COCO Detector Finds It? | Semantic Splat Finds It? | Splat Confidence | Position Error (m) |
|---|---|---|---|---|---|
| “refrigerator” | Yes | ||||
| ”kitchen counter” | No | N/A | |||
| … |
5.3 Query success rate
Issue 10 natural language navigation queries of varying complexity. For each query, record:| Query | Intent Parsed? | Valid Location? | Trajectory Reached Goal? | Splat Views Useful? |
|---|---|---|---|---|
| “go to the refrigerator" | ||||
| "explore the kitchen area” | ||||
| … |
5.4 Trajectory quality
For the “show me top trajectories to the refrigerator” query, compare the top-3 trajectories:| Trajectory | Path Length (m) | Min Clearance (m) | Visual Quality (1-5) | Notes |
|---|---|---|---|---|
| #1 | ||||
| #2 | ||||
| #3 |
References
Foundational
- B. Kerbl, G. Kopanas, T. Leimkuehler, G. Drettakis. 3D Gaussian Splatting for Real-Time Radiance Field Rendering. ACM TOG (SIGGRAPH) 2023.
- S. Chen, Y. Li, S. Chen, et al. A Survey on 3D Gaussian Splatting for Robotics. 2024.
Language-embedded Gaussian Splats
- S. Zhou, H. Chang, S. Jiang, Z. Fan, Z. Zhu, D. Xu, P. Chari, S. You, Z. Wang, A. Kadambi. Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields. CVPR 2024. GitHub
- M. Qin, W. Li, J. Zhou, H. Wang, H. Pfister. LangSplat: 3D Language Gaussian Splatting. CVPR 2024. GitHub
- W. Li, et al. LangSplatV2: Global 3D Codebook for Language Gaussian Splatting. NeurIPS 2025.
- J. Shi, J. Wang, L. Jiang, C. H. Tan, et al. Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding. CVPR 2024 / TPAMI 2025. GitHub
- Y. Wu, J. Zhang, et al. OpenGaussian: Towards Point-Level 3D Gaussian-based Open Vocabulary Understanding. NeurIPS 2024. GitHub
Re-localization
- Splat-Loc: 3D Gaussian Splatting Place Recognition and Localization. arxiv.org/abs/2312.02126
- GS-Loc: Vision Foundation Model-Driven Gaussian Splatting Localization. RAL 2025.
Frameworks and tools
- gsplat, CUDA-accelerated Gaussian Splatting rasterization library
- COLMAP, Structure-from-Motion and Multi-View Stereo
- open_clip, Open-source CLIP implementation
- HAMMER, Multi-robot collaborative semantic Gaussian Splatting with ROS
- ROSplat, ROS 2 Jazzy package for online Gaussian Splat visualization
Summary of Deliverables
| Task | Deliverables |
|---|---|
| Path A (Tasks A.1-A.4) or Path B (Tasks B.1-B.4) | Trained language-embedded Gaussian Splat of the Gazebo house world, feature extraction pipeline, working text query with heatmap output |
| Task 3: Semantic Query Interface | gs_semantic_query.py module with SemanticSplatQuery class, query results (position + heatmap + confidence) for 10+ objects |
| Task 4: LLM Navigation Agent | Agent source code with tool definitions, ROS integration, demo video showing at least 3 different semantic navigation queries being executed |
| Task 5: Evaluation | Object localization accuracy table (5 objects), open-vocab vs. closed-vocab comparison (10 queries), query success rate (10 navigation queries), trajectory comparison (top-3), written analysis of strengths and limitations |
/data/results/.

