Problem Statement
A robot explores an environment (Run A), detecting objects and recording where it saw them. Later (Run B), the robot is placed at an unknown location. It captures a few images, detects objects, and needs to determine where it is using only the visual observations and the semantic map built during Run A. This requires two capabilities:- A structured representation of what the robot saw and where - the semantic graph
- A query mechanism that matches new observations against stored ones and infers location - re-localization
Architecture
The graph builder runs as a standalone service alongside the detection pipeline. Both subscribe to the same Zenoh topic independently. Thedet_id field in the Zenoh payload is the join key between the two databases. pgvector stores the embedding vector; AGE stores the graph structure. Re-localization queries both.
Graph Schema
The semantic graph uses six node types connected by seven edge types:| Node | Key Properties |
|---|---|
| Run | run_id, started_at |
| Keyframe | run_id, keyframe_id, timestamp |
| Pose | map_x, map_y, map_yaw |
| Observation | det_id, class_name, confidence |
| Object | object_id, class_name, mean_x, mean_y, obs_count |
| Place | place_id, centroid_x, centroid_y, keyframe_count |
det_id that uniquely identifies it across both pgvector and the graph. This enables the cross-database join needed for re-localization.
Online DBSCAN for Place Construction
As the robot moves, keyframe poses arrive one at a time. These must be clustered into Places (spatial regions like rooms or hallway segments). Standard DBSCAN requires all points upfront. Online DBSCAN processes points incrementally and supports cluster merges - critical in house environments where the robot enters the same room from multiple doorways.Why online DBSCAN over simpler alternatives
Grid binning (divide the map into fixed cells) is simple but produces arbitrary rectangular places that ignore room boundaries. Leader clustering (assign to nearest centroid within radius) is fast but cannot merge two clusters when a new point bridges them. In a house, the robot might enter the living room from the hallway, then later from the kitchen. Leader clustering creates two separate living room clusters that never merge. Online DBSCAN detects the bridge and unifies them.Algorithm
For each new keyframe pose(x, y):
- Compute distances to all existing poses and identify neighbors within epsilon (1.5 m), including self
- Update neighbor sets of affected existing points
- Recompute core status: a point is core if it has at least
min_samples(3) neighbors including itself - Determine cluster assignment:
- If core neighbors belong to one existing cluster, join it
- If core neighbors span multiple clusters, merge them (lowest place_id survives, all points relabeled)
- If no neighboring clusters and the point is core, create a new cluster
- Otherwise the point is noise (may join a cluster later)
- Promote noise neighbors that became border points (neighbor of a newly-core point)
Cluster merge example
Object Landmark Fusion
The same physical object (e.g. a cup on a table) is detected in multiple keyframes as the robot passes. Landmark fusion merges these repeated observations into a single Object node.The position problem
The Zenoh payload includes the robot’s map-frame pose but not the object’s world position. Without depth data, the observation position used for fusion is the robot’s pose at the time of detection, not the object’s actual coordinates. This means:- Multiple objects detected from the same keyframe share the same spatial coordinates
- Spatial distance alone cannot distinguish them
- CLIP embedding cosine similarity is the primary discriminator
- Spatial distance serves as a secondary filter to prevent merging visually similar objects in different rooms
Algorithm
For each observation with classC, robot pose (x, y), and CLIP embedding e:
- Filter existing landmarks to class
C - For each candidate, compute cosine similarity
sim = dot(e, mean_embedding)(primary filter, threshold 0.7) - For candidates passing similarity, compute spatial distance (secondary filter, threshold 3.0 m)
- Best match above both thresholds: merge (update running averages of position, embedding, count)
- No match: create new Object node
Place Adjacency
Two places are adjacent if the robot drove between them. The graph builder trackslast_keyframe_place and creates an ADJACENT_TO edge (with shared_transitions count) whenever consecutive keyframes belong to different places. On cluster merge, adjacency edges from the absorbed place transfer to the survivor.
Identity and Cross-Database Joins
Every entity has a globally unique identifier that survives service restarts:| Entity | ID Format | Example |
|---|---|---|
| Run | hostname + timestamp | robot-20260324T130000 |
| Keyframe | (run_id, keyframe_id) | scoped to run |
| Observation | \{run_id\}_kf\{kf_id\}_d\{idx\} | robot-20260324T130000_kf42_d0 |
| Object | UUID | obj-a1b2c3 |
| Place | \{run_id\}_p\{counter\} | robot-20260324T130000_p3 |
det_id on each Observation is stored in both pgvector (detection_embeddings.det_id) and AGE (Observation node property). This is the join key for re-localization queries.
Semantic Re-Localization
Given a robot at an unknown location with a few object observations:Step 1: Vector search
Compute CLIP embeddings for query crops and find the top-k most visually similar stored detections:Step 2: Graph traversal
For each matcheddet_id, traverse the graph to find which Place the object is in:
Step 3: Aggregate and rank
Group results by Place, sum similarity scores, rank top-3. The winning Place’s centroid is the pose hypothesis.Why two databases
pgvector handles the high-dimensional KNN search (optimized vector index). AGE handles the graph traversal (Observation to Object to Place). Combining both in a single query would require scanning embeddings inside Cypher, which AGE is not optimized for. Thedet_id join bridges the two with minimal overhead.
Bootstrap and Idempotency
All clustering and fusion state lives in memory. On service restart, the graph builder reconstructs its state from the existing AGE graph:- Query all Keyframe/Pose nodes for the current run, ordered by keyframe_id
- Replay poses through online DBSCAN (without writing, since the graph already has the data)
- Query all Object nodes to rebuild the landmark index
- Record the highest keyframe_id seen and only process newer messages
det_id inserts in pgvector are rejected by the UNIQUE constraint.
Docker Services
The full pipeline requires these services:| Service | Role |
|---|---|
demo-world-house | Gazebo house simulation with Nav2 |
zenoh-router | Zenoh message bus |
zenoh-bridge | DDS to Zenoh bridge |
detector | YOLOv8 + CLIP keyframe detector (GPU) |
embedding-ingest | Writes embeddings to pgvector |
graph-builder | Writes property graph to AGE |
vector | PostgreSQL + pgvector (port 5436) |
age | PostgreSQL + Apache AGE (port 5435) |

