Skip to main content

Objectives

  1. Build a semantic spatial graph from robot detection events using PostgreSQL with Apache AGE (graph database extension) and pgvector (vector embeddings)
  2. Perform semantic re-localization - given a robot placed at an unknown pose, use visual similarity (CLIP embeddings) and graph structure to infer where the robot likely is

High-level Workflow

Key Concepts

Keyframe - A camera frame selected for storage. The robot’s camera streams at ~30 fps, but consecutive frames are nearly identical. A keyframe is sampled when the robot has moved a minimum distance (e.g. 0.5 m), rotated a minimum angle, or a time interval has elapsed. Each keyframe captures what the robot sees at a specific moment and location. Pose - The robot’s position and orientation in the map frame at the moment a keyframe was captured: (map_x, map_y, map_yaw, timestamp). This comes from SLAM or localization against a known map (Nav2 publishes via the map → odom → base_link TF chain). The pose anchors every keyframe to a physical location in the world. Observation - When YOLOv8 runs on a keyframe and detects an object (e.g. “cup at bbox 120,80,250,310120, 80, 250, 310 with 87% confidence”), that detection is an observation. Each keyframe can produce zero or many observations. Each observation also carries a CLIP embedding - a 512-dimensional vector computed from the cropped detection region. Object (Landmark) - The same physical cup is detected in multiple keyframes as the robot passes it from different angles. Object landmark fusion merges these repeated observations into a single entity using CLIP embedding similarity (cosine distance above a threshold) combined with spatial proximity (positions close in map frame). The fused object maintains: class label, mean position, observation count, and first/last seen timestamps. Place - The map is partitioned into places - spatial clusters of keyframe poses. This can be done with grid binning (e.g. 1 m × 1 m cells) or DBSCAN (density-based clustering). Each keyframe belongs to exactly one place. A place represents a navigable zone such as “kitchen corner” or “hallway segment.” Run - A single exploration session: one continuous period of the robot navigating. Run A is the mapping run that builds the semantic graph, Run B is the re-localization run that queries it.

Per-Frame Processing Flow

The detector subscribes to two Zenoh topics: camera images (at 10 Hz max) and odometry. For each incoming camera frame:
  1. Rate limit check - if less than 100 ms since the last frame was considered, skip immediately
  2. Cache the latest robot pose from the odometry subscriber (runs independently, updating (x, y, yaw) on every odom message)
  3. Keyframe gate - compare the current pose against the pose of the last accepted keyframe:
    • If the robot has moved less than 0.5 m AND rotated less than 15 degrees, discard the frame (no inference)
    • If either threshold is exceeded, this frame is a keyframe - proceed to step 4
  4. Deserialize the CDR-encoded image and decode it to a numpy array
  5. Run YOLOv8 inference on the frame to produce bounding boxes
  6. For each detection, crop the bounding box region from the frame
  7. Batch all crops through the CLIP encoder (ViT-B/32) to produce 512-dim L2-normalized embeddings
  8. Publish a JSON envelope to tb/detections containing:
    • keyframe_id - monotonically increasing integer
    • timestamp - wall clock time
    • map_x, map_y, map_yaw - robot pose in map frame
    • detections - array of \{class, confidence, bbox, embedding, embedding_dim, embedding_model\}
When the robot is stationary, no frames pass the keyframe gate . When the robot is moving at typical speeds (~0.2 m/s), a keyframe fires roughly every 2-3 seconds.

Property Graph Structure

The entities above are linked in Apache AGE as a property graph:
Run A
 └── Keyframe_001 (t=1.2s)
      ├── Pose (x=2.1, y=0.8, yaw=45°)
      ├── Observation → Object "cup_1" → Place "kitchen"
      └── Observation → Object "chair_3" → Place "kitchen"
 └── Keyframe_002 (t=3.5s)
      ├── Pose (x=3.4, y=1.2, yaw=90°)
      └── Observation → Object "cup_1" → Place "hallway"
The edges encode the following relationships:
  • Run → Keyframe: this keyframe was captured during this run
  • Keyframe → Pose: the robot was at this map-frame position when the keyframe was captured
  • Keyframe → Observation: this object detection was produced from this keyframe
  • Observation → Object: this observation corresponds to this fused physical landmark
  • Object → Place: this object is located in this place
  • Place → Place (adjacent): these places are directly reachable from each other
In Run B the robot wakes up at an unknown location. It captures a few frames, detects objects, and computes CLIP embeddings for each crop. The re-localization query then:
  1. Runs KNN search in pgvector to find the most visually similar stored embeddings
  2. Follows graph edges from those embeddings to their Objects, then to their Places
  3. Ranks places by aggregated similarity score
  4. Outputs the top-3 candidate places and a pose hypothesis
The graph structure is what converts “I see something that looks like a stored object” into “I am probably in this place.”

Background - Semantic + ROS2 Mapping Concepts

For Zenoh background videos, see the Zenoh middleware page.

ROS 2 Mapping and Frame Semantics (map vs odom)

Consult REP 105 - Coordinate Frames for Mobile Platforms and the Nav2 Transforms Setup Guide for the relationship between the map, odom, and base_link frames used in this assignment.

1. Global Pose Requirement

Produce map-frame poses for keyframes using one:
  • SLAM toolbox mapping mode, or
  • Localization against a known map
For each keyframe store:
  • map_x
  • map_y
  • map_yaw
  • timestamp
This is required for global semantic mapping.

2. Vector Embeddings (pgvector)

You will:
  • Crop each detection bounding box
  • Compute an embedding (CLIP or similar)
  • Store vectors in pgvector
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE detection_embeddings (
  det_pk bigint PRIMARY KEY REFERENCES detections(det_pk),
  model text,
  embedding vector(512)
);
Document your embedding model and dimension.

3. Semantic Graph with Apache AGE

Enable AGE per session:
LOAD 'age';
SET search_path = ag_catalog, "$user", public;
CREATE EXTENSION age;

Node types

  • Run
  • Keyframe
  • Pose (map frame)
  • Place
  • Object
  • Observation

Edge types

  • Run → Keyframe
  • Keyframe → Pose
  • Keyframe → Observation
  • Observation → Object
  • Object → Place
  • Place → Place (adjacent)

4. Place Construction

Cluster map-frame poses into places using:
  • grid binning, or
  • DBSCAN
Each keyframe must belong to exactly one place.

5. Object Landmark Fusion

Merge observations into object landmarks using:
  • spatial distance threshold
  • embedding similarity threshold
Each landmark must maintain:
  • class
  • mean position
  • first_seen / last_seen
  • observation count

6. Semantic Re-Localization Task

Scenario

Run A:
  • Explore maze
  • Build semantic graph
Run B:
  • Restart at unknown pose
  • Capture 1–3 object crops
  • Compute embeddings
  • Run vector KNN search
  • Infer top-3 likely places
  • Output best pose hypothesis

Minimal algorithm

  • Compute embeddings for query crops
  • KNN search in pgvector
  • Join to Object → Place
  • Rank places by similarity score

Required Queries

Provide working scripts for:

Vector

Top-k visually similar detections for a query crop.

Graph

Reachable places within N hops containing an object class.

Re-localization

Top-3 candidate places from query crops.

Deliverables

  • pgvector + AGE schema
  • Embedding generator
  • Graph builder
  • Semantic relocalizer
  • Demo report with success and failure analysis