Skip to main content
Open In Colab
This part is optional. It builds on the work from Parts 1-2 and is intended for students who want to explore open-vocabulary scene understanding and LLM-powered navigation on top of their trained Gaussian Splats.

Introduction

In Assignment 1 you trained a standard 3D Gaussian Splat using gsplat’s simple_trainer on both a real-world scene (your room) and a simulated scene (the Gazebo house world). That splat captures geometry and appearance — it can render novel views and represent free space, but it has no understanding of what is in the scene. A refrigerator and a bookshelf are just collections of colored Gaussians. In this assignment you will extend your trained splat with language understanding, so that every Gaussian carries not only position, color, and opacity but also a semantic feature vector. The 3D map becomes queryable with arbitrary natural language — a capability that fundamentally changes what a robot can do with its world model.

From closed vocabulary to open-world semantics

Traditional object detection in robotics relies on models trained on fixed class sets. A COCO-trained YOLO detector recognizes 80 object classes. If “refrigerator” is one of those 80 classes, the detector finds it. If you ask for “the red mug on the counter,” “the fire extinguisher,” or “the area that looks like a kitchen” — concepts outside the training vocabulary — the detector is blind. CLIP (Contrastive Language-Image Pre-training) changes this equation. Trained on 400 million image-text pairs from the internet, CLIP encodes a broad visual-semantic understanding into a shared embedding space where images and text are directly comparable. A CLIP feature vector for a patch of an image can be compared with the CLIP encoding of any text string using cosine similarity — no fixed class list required. Language-embedded Gaussian Splatting takes this one step further: each Gaussian in the 3D scene gets an additional learnable feature vector (a CLIP embedding) alongside its geometric and appearance attributes. During training, these features are supervised by dense CLIP feature maps extracted from the training images. At query time, you render the per-Gaussian features from any viewpoint and compare them with a text query’s CLIP embedding. The result is a heatmap showing where in the scene the queried concept appears. This enables queries no fixed-class detector can handle:
  • “where is the refrigerator?” (COCO-equivalent)
  • “the area that looks like a kitchen” (spatial/semantic region)
  • “something to sit on” (functional description)
  • “the doorway to the bedroom” (relational/contextual)
Combined with an LLM agent, a language-embedded splat enables navigation commands in natural language rather than explicit coordinates.

Key insight: gsplat already supports feature rendering

gsplat’s rasterizer is not limited to rendering RGB. The rasterization function accepts per-Gaussian colors of shape [N, D] where D can be any channel count — 3 for RGB, 512 for CLIP embeddings, or any compressed dimension in between. This is the foundation both implementation paths in this assignment build on. The rasterizer alpha-composites whatever per-Gaussian vectors you provide, just as it does for color.

Recent methods

MethodKey IdeaSpeedReference
Feature-3DGSN-dimensional rasterizer for arbitrary foundation model featuresReal-timeCVPR 2024
LangSplatScene-wise language autoencoder, CLIP features in latent space199x faster than LERFCVPR 2024
LangSplatV2Global 3D codebook for high-dim features450+ FPSNeurIPS 2025
LEGaussiansQuantized CLIP + DINO per GaussianReal-timeCVPR 2024 / TPAMI 2025
OpenGaussianPoint-level open-vocabulary understandingReal-timeNeurIPS 2024

Prerequisites

  • Completed Assignment 1 (trained gsplat model of your room and the Gazebo house world, with COLMAP data)
  • Familiarity with CLIP (OpenAI CLIP paper, HuggingFace transformers CLIP)
  • GPU with 16+ GB VRAM (language features increase memory requirements)
  • Python 3.10+, PyTorch 2.1+, CUDA 11.8+

Choose Your Implementation Path

This assignment offers two implementation paths. Both use the same COLMAP-format data from Assignment 1 and produce the same deliverables. Choose the path that matches your goals. Use the Feature-3DGS codebase, which provides a complete pipeline for embedding foundation model features into Gaussian Splats.
  • What you get: Feature extraction scripts, an N-dimensional Gaussian rasterizer, training pipeline, and query utilities — all working out of the box with COLMAP data
  • Architecture: Extends the standard 3DGS rasterizer to render N-dimensional feature vectors alongside RGB. Distills 2D foundation model features (SAM, CLIP-LSeg) into per-Gaussian feature attributes
  • Dependencies: PyTorch 2.4+, CUDA 11.8+
  • Tradeoff: You use an existing, well-tested codebase — faster path to results, but less low-level understanding of the rendering internals

Path B: Build from Scratch on gsplat (Advanced)

Build the entire language embedding pipeline yourself on top of gsplat.
  • What you build: CLIP feature extraction from training images, per-Gaussian feature training with a modified simple_trainer, text query interface
  • Foundation: gsplat’s rasterization already supports arbitrary-dimension per-Gaussian features — you implement everything above that primitive
  • Tradeoff: Significantly more work, but you gain deep understanding of how language-embedded splats work and full flexibility to experiment
Both paths converge at Task 3 (Semantic Query Interface), Task 4 (LLM Navigation Agent), and Task 5 (Evaluation), which are identical regardless of path chosen.

Path A: Feature-3DGS

Task A.1: Setup

Clone and install the Feature-3DGS codebase:
git clone https://github.com/ShijieZhou-UCLA/feature-3dgs.git
cd feature-3dgs
pip install -r requirements.txt

# Install the N-dimensional Gaussian rasterizer
pip install submodules/diff-gaussian-rasterization-feature/
pip install submodules/simple-knn/
Architecture overview: Feature-3DGS extends the standard 3DGS rasterizer to render N-dimensional feature vectors alongside RGB. The key insight is that the alpha-compositing operation used to render color can render any per-Gaussian vector. During training, the system:
  1. Extracts dense 2D feature maps from each training image using foundation models (SAM, CLIP-LSeg)
  2. Assigns each Gaussian a learnable feature vector of dimension D
  3. Renders these feature vectors using the N-dimensional rasterizer
  4. Supervises with a feature distillation loss: MSE between rendered features and the ground-truth 2D feature maps
  5. Total loss = photometric RGB loss + feature distillation loss
The result is a Gaussian Splat where each Gaussian encodes both appearance (RGB) and semantics (CLIP features).

Task A.2: Extract Foundation Model Features

Before training, you need per-image dense feature maps as supervision targets. Feature-3DGS supports multiple foundation models. For this assignment, extract CLIP-based features:
  1. SAM features — Segment Anything masks for each training image, providing instance-level boundaries
  2. CLIP-LSeg features — per-pixel language-grounded features from LSeg (Language-driven Semantic Segmentation)
# Extract SAM features
python extract_sam_features.py \
  --data_dir /data/captures/house_run \
  --output_dir /data/features/house_run/sam

# Extract CLIP-LSeg features
python extract_clip_features.py \
  --data_dir /data/captures/house_run \
  --output_dir /data/features/house_run/clip
Note: Exact script names and arguments depend on the Feature-3DGS version — consult their README for the current API. The data directory should point to your COLMAP output from Assignment 1 (containing sparse/0/cameras.bin, images.bin, points3D.bin, and the images/ folder).
The extraction step processes each training image through the foundation models and saves dense feature maps (typically as .npy or .pt files) that will be loaded during training.

Task A.3: Train Feature-Embedded Gaussians

Train the language-embedded Gaussian Splat using your COLMAP data and the extracted features:
python train.py \
  --source_path /data/captures/house_run \
  --feature_dir /data/features/house_run/clip \
  --model_path /data/results/house_run_semantic \
  --iterations 30000
Monitor training for:
  • RGB PSNR — should be comparable to your Assignment 1 splat (the feature head should not degrade visual quality)
  • Feature loss — should decrease steadily, indicating the per-Gaussian features are learning to reproduce the 2D CLIP maps
Training typically takes 30-60 minutes on a single GPU, slightly longer than a standard RGB-only splat due to the additional feature rendering pass.

Task A.4: Query with Text

Once trained, you can render semantic feature maps from arbitrary viewpoints and compare them with text queries:
import torch
from open_clip import create_model_and_transforms, get_tokenizer

# Encode the text query into CLIP embedding space
model, _, _ = create_model_and_transforms('ViT-B-32', pretrained='openai')
tokenizer = get_tokenizer('ViT-B-32')
text = tokenizer(["refrigerator"])
text_features = model.encode_text(text)  # shape: (1, D)
text_features = text_features / text_features.norm(dim=-1, keepdim=True)

# Render the feature map from a chosen viewpoint
# rendered_features shape: (H, W, D) where D is the CLIP embedding dimension
rendered_features = render_features(viewpoint_camera)  # from Feature-3DGS
rendered_features = rendered_features / rendered_features.norm(dim=-1, keepdim=True)

# Compute cosine similarity between the text query and each pixel's features
similarity = torch.nn.functional.cosine_similarity(
    rendered_features, 
    text_features.unsqueeze(0).unsqueeze(0),  # broadcast to (1, 1, D)
    dim=-1
)
# similarity is a (H, W) heatmap — high values indicate the queried object
Verify your pipeline by querying for objects you can visually confirm in the Gazebo house world (e.g., “refrigerator”, “couch”, “table”).

Path B: Build from Scratch on gsplat

This is the advanced path. You will build the entire language embedding pipeline on top of gsplat’s rasterization primitive.

Task B.1: Understand gsplat’s Feature Rendering

gsplat’s rasterization function accepts per-Gaussian colors of arbitrary dimension D. Normally D=3 (RGB), but you can pass D=512 (full CLIP embedding), D=64 (compressed), or any other dimension. The rasterizer alpha-composites whatever vectors you provide, producing a (H, W, D) output image. This is the key API:
from gsplat import rasterization

# Standard RGB rendering (D=3)
renders, alphas, info = rasterization(
    means=gaussians.means,         # (N, 3) — positions
    quats=gaussians.quats,         # (N, 4) — rotations
    scales=gaussians.scales,       # (N, 3) — scales
    opacities=gaussians.opacities, # (N,)   — opacities
    colors=gaussians.colors,       # (N, 3) — RGB
    viewmats=viewmats,             # (C, 4, 4)
    Ks=Ks,                         # (C, 3, 3)
    width=width,
    height=height,
)
# renders shape: (C, H, W, 3)

# Feature rendering — same API, just change the colors dimension
feature_renders, alphas, info = rasterization(
    means=gaussians.means,         # (N, 3)
    quats=gaussians.quats,         # (N, 4)
    scales=gaussians.scales,       # (N, 3)
    opacities=gaussians.opacities, # (N,)
    colors=gaussians.features,     # (N, D) — D-dimensional features
    viewmats=viewmats,
    Ks=Ks,
    width=width,
    height=height,
)
# feature_renders shape: (C, H, W, D)
The geometry (means, quats, scales, opacities) and camera parameters are identical — only the colors argument changes. This means you can share the Gaussian geometry from your Assignment 1 model and simply add a new learnable feature attribute.

Task B.2: Extract CLIP Features from Training Images

For each training image, you need dense per-pixel CLIP features as supervision targets.
import torch
from PIL import Image
from transformers import CLIPModel, CLIPProcessor

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16")

def extract_dense_clip_features(image_path: str) -> torch.Tensor:
    """Extract dense per-patch CLIP features from an image.
    
    CLIP ViT-B/16 divides a 224x224 input into 14x14 = 196 patches.
    Each patch produces a 768-dim feature vector.
    We extract these patch tokens (excluding CLS) and upsample 
    to the original image resolution.
    """
    image = Image.open(image_path).convert("RGB")
    inputs = processor(images=image, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model.vision_model(**inputs, output_hidden_states=True)
    
    # Extract patch tokens from the last hidden state
    # Shape: (1, 197, 768) — 1 CLS token + 196 patch tokens
    patch_tokens = outputs.last_hidden_state[:, 1:, :]  # (1, 196, 768)
    
    # Reshape to spatial grid: (1, 14, 14, 768) -> (1, 768, 14, 14)
    B, N, D = patch_tokens.shape
    h = w = int(N ** 0.5)
    features = patch_tokens.reshape(B, h, w, D).permute(0, 3, 1, 2)
    
    # Upsample to original image resolution
    orig_h, orig_w = image.size[1], image.size[0]
    features = torch.nn.functional.interpolate(
        features, size=(orig_h, orig_w), mode='bilinear', align_corners=False
    )
    # Output: (1, 768, H, W) -> (H, W, 768)
    return features.squeeze(0).permute(1, 2, 0)
Pipeline:
  1. For each training image in your COLMAP dataset, extract dense CLIP features using the function above
  2. Save the feature maps to disk (they are large — 768 floats per pixel)
  3. Optional but recommended: Compress features with PCA or a small autoencoder to reduce dimensionality from 768 to 64 or 128. This dramatically reduces memory (768 floats per Gaussian vs. 64) and speeds up rasterization. LangSplat uses a scene-wise autoencoder for this purpose

Task B.3: Modify the Training Loop

Starting from gsplat’s simple_trainer.py, you need to add four things:
  1. Per-Gaussian learnable feature vector:
# In the model initialization, alongside existing parameters:
feature_dim = 64  # or 768 if not compressing
self.features = torch.nn.Parameter(
    torch.randn(N, feature_dim) * 0.01  # small random init
)
# Add to optimizer alongside means, quats, scales, etc.
  1. Feature rendering pass (in addition to the RGB pass):
# In the training loop, after RGB rendering:
feature_renders, _, _ = rasterization(
    means=self.means,
    quats=self.quats,
    scales=self.scales,
    opacities=self.opacities,
    colors=self.features,       # (N, feature_dim)
    viewmats=viewmats,
    Ks=Ks,
    width=width,
    height=height,
)
# feature_renders shape: (C, H, W, feature_dim)
  1. Feature distillation loss:
# gt_features: ground-truth CLIP features for this view, shape (H, W, feature_dim)
feature_loss = torch.nn.functional.mse_loss(
    feature_renders.squeeze(0),  # (H, W, feature_dim)
    gt_features                   # (H, W, feature_dim)
)
  1. Combined loss:
lambda_feat = 0.1  # tune this hyperparameter
total_loss = rgb_loss + lambda_feat * feature_loss
total_loss.backward()
Important considerations:
  • The feature rendering pass uses the same geometry (means, quats, scales, opacities) as RGB rendering. Gradients from the feature loss will flow back through geometry parameters, which can help or hurt RGB quality depending on lambda_feat
  • Densification and pruning from the base trainer still apply — new Gaussians created by densification need their feature vectors initialized (e.g., copy from the parent Gaussian)
  • Memory: with N=200,000 Gaussians and feature_dim=64, the feature parameter adds ~50 MB. With feature_dim=768 it adds ~600 MB

Task B.4: Query with Text

Build a query function that encodes text with CLIP and compares against rendered features:
import torch
from transformers import CLIPModel, CLIPProcessor

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch16")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch16")

def query_scene(text_query: str, viewpoint, trained_splat) -> torch.Tensor:
    """Query the trained splat with a text string.
    
    Returns a (H, W) relevancy heatmap.
    """
    # Encode text
    inputs = processor(text=[text_query], return_tensors="pt", padding=True)
    with torch.no_grad():
        text_features = model.get_text_features(**inputs)  # (1, 512)
    text_features = text_features / text_features.norm(dim=-1, keepdim=True)
    
    # If you used PCA compression, project text_features through the same PCA
    # text_features_compressed = pca.transform(text_features)
    
    # Render feature map from viewpoint
    feature_map = render_features(viewpoint, trained_splat)  # (H, W, D)
    feature_map = feature_map / feature_map.norm(dim=-1, keepdim=True)
    
    # Cosine similarity
    similarity = torch.nn.functional.cosine_similarity(
        feature_map,
        text_features.unsqueeze(0),  # (1, 1, D) broadcast
        dim=-1
    )
    return similarity  # (H, W) heatmap
If you used PCA or an autoencoder to compress features during training, you must apply the same compression to the text query embedding before computing similarity. This is a common source of bugs — make sure the text and rendered features live in the same space.

Task 3: Build the Semantic Query Interface

This task is the same for both paths. Build a Python module (gs_semantic_query.py) that provides a clean interface for querying your trained language-embedded splat. The module should accept a text query and return structured results:
from gs_semantic_query import SemanticSplatQuery

query_engine = SemanticSplatQuery(
    model_path="/data/results/house_run_semantic"
)

result = query_engine.query("refrigerator")
# result.position        → np.array([x, y, z]) in world frame
# result.relevancy_image → np.ndarray (H, W, 3) heatmap visualization
# result.confidence      → float (0-1) cosine similarity score
Implementation requirements:
  1. Object localization: The query should not just produce a heatmap — it should return a 3D world-frame position. To do this:
    • Render feature maps from multiple viewpoints
    • Identify high-similarity regions in each view
    • Back-project the high-similarity pixels to 3D using the known camera parameters and depth (which you can render from the splat)
    • Aggregate the 3D points into a centroid
  2. Relevancy visualization: Generate a color-coded heatmap overlaid on the rendered RGB image, showing where the query matches.
  3. Confidence scoring: Report the maximum cosine similarity as a confidence score. Values above 0.25 typically indicate a meaningful match; below 0.15 is usually noise.
Test your query interface with these queries: COCO-equivalent queries (baseline — a fixed-class detector could handle these):
  • “refrigerator”
  • “couch”
  • “dining table”
  • “chair”
  • “tv monitor”
Open-vocabulary queries (the new capability):
  • “kitchen counter”
  • “the doorway to the bedroom”
  • “something to sit on”
  • “area with appliances”
  • “wooden bookshelf”
Note about Gazebo: CLIP was trained on real-world images. On simulated scenes, semantic accuracy depends on how recognizable the simulated objects are to CLIP. Objects with distinctive shapes (refrigerator, couch) typically work better than those that rely on photorealistic textures. Document any cases where the simulation gap affects results.

Task 4: LLM-Powered Navigation Agent

This task is the same for both paths. Build an agent that accepts natural language navigation commands and translates them into robot trajectories using the semantic splat as a world model.

Architecture

User prompt
  "show me all the top trajectories that reach the refrigerator location"
    |
    v
LLM Agent (Claude / GPT-4)
    |
    | 1. Parse intent -> target object = "refrigerator"
    | 2. Query semantic splat -> get 3D location
    | 3. Get current robot pose from /odom
    | 4. Query Nav2 planner for candidate paths
    | 5. Rank trajectories by criteria
    | 6. Render expected views along top trajectories from splat
    v
Response with trajectory visualizations

Supported query types

The agent should support at minimum these types of queries:
Query TypeExampleExpected Behavior
Navigate to object”go to the refrigerator”Resolve object location via semantic splat, plan path via Nav2, execute
Show trajectories”show me all the top trajectories that reach the refrigerator location”Resolve location, compute multiple candidate paths, render expected views along each path from the splat, return ranked results
Scene query”what objects are near the dining table?”Query multiple object types, compute spatial relationships, return natural language description
Exploration”explore the area around the kitchen”Resolve “kitchen” region, generate waypoints for coverage, optionally extend the splat with new captures

Implementation guidance

1. LLM tool interface: Define the agent’s tools as functions the LLM can call:
def query_object_location(object_name: str) -> dict:
    """Query the semantic splat for an object's 3D position.
    Returns: {position: [x, y, z], confidence: float}
    """

def get_robot_pose() -> dict:
    """Get the robot's current pose from /odom.
    Returns: {x: float, y: float, z: float, yaw: float}
    """

def plan_path(start: list, goal: list) -> list:
    """Plan a path from start to goal using Nav2.
    Returns: list of [x, y, z, yaw] waypoints
    """

def render_view_from_splat(pose: list) -> str:
    """Render what the robot would see from a given pose.
    Returns: path to the rendered image
    """

def send_nav2_goal(x: float, y: float, yaw: float) -> str:
    """Send a navigation goal to the robot via Nav2.
    Returns: status string
    """
2. Trajectory ranking: For the “show me top trajectories” query, rank by:
  • Path length — shorter is better
  • Clearance from obstacles — safer paths score higher
  • Visual coverage — paths through well-reconstructed areas of the splat (where the robot can verify its position via render-and-compare localization)
  • Semantic relevance — paths that pass by related objects (e.g., a path to the refrigerator that goes through the kitchen)
3. Splat-based visualization: For each candidate trajectory, render what the robot would see at key waypoints along the path using the trained splat. This gives the user a visual preview of the journey before committing to execution. This is one of the unique advantages of having a renderable 3D world model — something a point cloud or occupancy grid cannot provide. 4. ROS integration: The agent communicates with the robot through the existing ROS 2 / Zenoh infrastructure:
  • Read current pose from /odom
  • Send goals via Nav2’s NavigateToPose action
  • Monitor progress via Nav2 feedback

Research pointers

These papers are directly relevant to implementing the navigation agent:
  • Splat-Loc (arxiv.org/abs/2312.02126) — render-and-compare re-localization at ~25 Hz against a pre-built Gaussian Splat map. Relevant for verifying the robot’s position against the splat during trajectory execution.
  • GS-Loc (RAL 2025) — vision foundation model-driven re-localization using Gaussian Splats.
  • HAMMER (arxiv.org/abs/2501.14147) — multi-robot collaborative semantic Gaussian Splatting with ROS communication. Shows how to architect the ROS-to-splat interface.
  • ROSplat (github.com/shadygm/ROSplat) — ROS 2 Jazzy package for online Gaussian Splat visualization with custom GaussianArray messages.

Task 5: Evaluation

This task is the same for both paths. Evaluate your semantic navigation agent across four dimensions:

5.1 Object localization accuracy

For 5 objects in the Gazebo house world, compare the semantic splat’s estimated 3D position with the ground-truth position from the Gazebo world file (.world or .sdf).
ObjectGround Truth (x, y, z)Estimated (x, y, z)Error (m)
refrigerator
couch
dining table
Report the mean position error in meters. Discuss sources of error (splat reconstruction quality, CLIP feature accuracy on simulated images, viewpoint selection for localization).

5.2 Open-vocabulary vs. closed-vocabulary

For 10 queries (5 within COCO vocabulary, 5 outside it), compare:
QueryIn COCO?COCO Detector Finds It?Semantic Splat Finds It?Splat ConfidencePosition Error (m)
“refrigerator”Yes
”kitchen counter”NoN/A
This comparison is the concrete evidence of the capability shift from closed-vocabulary to open-vocabulary scene understanding.

5.3 Query success rate

Issue 10 natural language navigation queries of varying complexity. For each query, record:
QueryIntent Parsed?Valid Location?Trajectory Reached Goal?Splat Views Useful?
“go to the refrigerator"
"explore the kitchen area”
Report the overall success rate at each stage of the pipeline.

5.4 Trajectory quality

For the “show me top trajectories to the refrigerator” query, compare the top-3 trajectories:
TrajectoryPath Length (m)Min Clearance (m)Visual Quality (1-5)Notes
#1
#2
#3
“Visual quality” rates how useful the rendered splat views along the trajectory are for previewing the journey (1 = unusable artifacts, 5 = clear and informative).

References

Foundational

Language-embedded Gaussian Splats

Re-localization

  • Splat-Loc: 3D Gaussian Splatting Place Recognition and Localization. arxiv.org/abs/2312.02126
  • GS-Loc: Vision Foundation Model-Driven Gaussian Splatting Localization. RAL 2025.

Frameworks and tools

  • gsplat — CUDA-accelerated Gaussian Splatting rasterization library
  • COLMAP — Structure-from-Motion and Multi-View Stereo
  • open_clip — Open-source CLIP implementation
  • HAMMER — Multi-robot collaborative semantic Gaussian Splatting with ROS
  • ROSplat — ROS 2 Jazzy package for online Gaussian Splat visualization

Summary of Deliverables

TaskDeliverables
Path A (Tasks A.1-A.4) or Path B (Tasks B.1-B.4)Trained language-embedded Gaussian Splat of the Gazebo house world, feature extraction pipeline, working text query with heatmap output
Task 3: Semantic Query Interfacegs_semantic_query.py module with SemanticSplatQuery class, query results (position + heatmap + confidence) for 10+ objects
Task 4: LLM Navigation AgentAgent source code with tool definitions, ROS integration, demo video showing at least 3 different semantic navigation queries being executed
Task 5: EvaluationObject localization accuracy table (5 objects), open-vocab vs. closed-vocab comparison (10 queries), query success rate (10 navigation queries), trajectory comparison (top-3), written analysis of strengths and limitations
All outputs (trained models, feature maps, evaluation results) go to /data/results/.