Skip to main content
This page describes the methodology for building a semantic spatial graph from robot detection events and using it for visual re-localization. The graph is constructed incrementally as the robot explores, stored in Apache AGE (a PostgreSQL graph extension), and queried jointly with pgvector embeddings to infer the robot’s location from visual observations alone. The reference implementation is in the turtlebot-maze repository. This page builds on the Object Detection page, which covers the upstream detection and embedding pipeline.

Problem Statement

A robot explores an environment (Run A), detecting objects and recording where it saw them. Later (Run B), the robot is placed at an unknown location. It captures a few images, detects objects, and needs to determine where it is using only the visual observations and the semantic map built during Run A. This requires two capabilities:
  1. A structured representation of what the robot saw and where - the semantic graph
  2. A query mechanism that matches new observations against stored ones and infers location - re-localization

Architecture

The graph builder runs as a standalone service alongside the detection pipeline. Both subscribe to the same Zenoh topic independently. The det_id field in the Zenoh payload is the join key between the two databases. pgvector stores the embedding vector; AGE stores the graph structure. Re-localization queries both.

Graph Schema

The semantic graph uses six node types connected by seven edge types:
NodeKey Properties
Runrun_id, started_at
Keyframerun_id, keyframe_id, timestamp
Posemap_x, map_y, map_yaw
Observationdet_id, class_name, confidence
Objectobject_id, class_name, mean_x, mean_y, obs_count
Placeplace_id, centroid_x, centroid_y, keyframe_count
Each Observation carries a det_id that uniquely identifies it across both pgvector and the graph. This enables the cross-database join needed for re-localization.

Online DBSCAN for Place Construction

As the robot moves, keyframe poses arrive one at a time. These must be clustered into Places (spatial regions like rooms or hallway segments). Standard DBSCAN requires all points upfront. Online DBSCAN processes points incrementally and supports cluster merges - critical in house environments where the robot enters the same room from multiple doorways.

Why online DBSCAN over simpler alternatives

Grid binning (divide the map into fixed cells) is simple but produces arbitrary rectangular places that ignore room boundaries. Leader clustering (assign to nearest centroid within radius) is fast but cannot merge two clusters when a new point bridges them. In a house, the robot might enter the living room from the hallway, then later from the kitchen. Leader clustering creates two separate living room clusters that never merge. Online DBSCAN detects the bridge and unifies them.

Algorithm

For each new keyframe pose (x, y):
  1. Compute distances to all existing poses and identify neighbors within epsilon (1.5 m), including self
  2. Update neighbor sets of affected existing points
  3. Recompute core status: a point is core if it has at least min_samples (3) neighbors including itself
  4. Determine cluster assignment:
    • If core neighbors belong to one existing cluster, join it
    • If core neighbors span multiple clusters, merge them (lowest place_id survives, all points relabeled)
    • If no neighboring clusters and the point is core, create a new cluster
    • Otherwise the point is noise (may join a cluster later)
  5. Promote noise neighbors that became border points (neighbor of a newly-core point)
Place centroids are maintained as running averages. On merge, centroids are recomputed from the combined member sets.

Cluster merge example

Before:  Place A (kitchen entry)    Place B (hallway entry)
           *  *  *                     *  *  *

After bridge point arrives:
           *  *  *  ---- * ----  *  *  *
                    (bridge)
           All points now in Place A (survivor)

Object Landmark Fusion

The same physical object (e.g. a cup on a table) is detected in multiple keyframes as the robot passes. Landmark fusion merges these repeated observations into a single Object node.

The position problem

The Zenoh payload includes the robot’s map-frame pose but not the object’s world position. Without depth data, the observation position used for fusion is the robot’s pose at the time of detection, not the object’s actual coordinates. This means:
  • Multiple objects detected from the same keyframe share the same spatial coordinates
  • Spatial distance alone cannot distinguish them
  • CLIP embedding cosine similarity is the primary discriminator
  • Spatial distance serves as a secondary filter to prevent merging visually similar objects in different rooms

Algorithm

For each observation with class C, robot pose (x, y), and CLIP embedding e:
  1. Filter existing landmarks to class C
  2. For each candidate, compute cosine similarity sim = dot(e, mean_embedding) (primary filter, threshold 0.7)
  3. For candidates passing similarity, compute spatial distance (secondary filter, threshold 3.0 m)
  4. Best match above both thresholds: merge (update running averages of position, embedding, count)
  5. No match: create new Object node
The landmark’s mean position converges toward the centroid of robot poses from which the object was observed. For precise object localization, depth integration would be needed.

Place Adjacency

Two places are adjacent if the robot drove between them. The graph builder tracks last_keyframe_place and creates an ADJACENT_TO edge (with shared_transitions count) whenever consecutive keyframes belong to different places. On cluster merge, adjacency edges from the absorbed place transfer to the survivor.

Identity and Cross-Database Joins

Every entity has a globally unique identifier that survives service restarts:
EntityID FormatExample
Runhostname + timestamprobot-20260324T130000
Keyframe(run_id, keyframe_id)scoped to run
Observation\{run_id\}_kf\{kf_id\}_d\{idx\}robot-20260324T130000_kf42_d0
ObjectUUIDobj-a1b2c3
Place\{run_id\}_p\{counter\}robot-20260324T130000_p3
The det_id on each Observation is stored in both pgvector (detection_embeddings.det_id) and AGE (Observation node property). This is the join key for re-localization queries.

Semantic Re-Localization

Given a robot at an unknown location with a few object observations: Compute CLIP embeddings for query crops and find the top-k most visually similar stored detections:
SELECT det_id, embedding <=> $query_vec AS distance
FROM detection_embeddings
ORDER BY distance
LIMIT 10;

Step 2: Graph traversal

For each matched det_id, traverse the graph to find which Place the object is in:
SELECT * FROM cypher('semantic_map', $$
  MATCH (obs:Observation {det_id: 'robot-20260324T130000_kf42_d0'})
        -[:OBSERVES]->(obj:Object)
        -[:LOCATED_IN]->(p:Place)
  RETURN p.place_id, p.centroid_x, p.centroid_y, obj.class_name
$$) AS (place_id agtype, cx agtype, cy agtype, class agtype);

Step 3: Aggregate and rank

Group results by Place, sum similarity scores, rank top-3. The winning Place’s centroid is the pose hypothesis.

Why two databases

pgvector handles the high-dimensional KNN search (optimized vector index). AGE handles the graph traversal (Observation to Object to Place). Combining both in a single query would require scanning embeddings inside Cypher, which AGE is not optimized for. The det_id join bridges the two with minimal overhead.

Bootstrap and Idempotency

All clustering and fusion state lives in memory. On service restart, the graph builder reconstructs its state from the existing AGE graph:
  1. Query all Keyframe/Pose nodes for the current run, ordered by keyframe_id
  2. Replay poses through online DBSCAN (without writing, since the graph already has the data)
  3. Query all Object nodes to rebuild the landmark index
  4. Record the highest keyframe_id seen and only process newer messages
Idempotency is enforced by checking for existing nodes before creation (MERGE in Cypher). Duplicate det_id inserts in pgvector are rejected by the UNIQUE constraint.

Docker Services

The full pipeline requires these services:
ServiceRole
demo-world-houseGazebo house simulation with Nav2
zenoh-routerZenoh message bus
zenoh-bridgeDDS to Zenoh bridge
detectorYOLOv8 + CLIP keyframe detector (GPU)
embedding-ingestWrites embeddings to pgvector
graph-builderWrites property graph to AGE
vectorPostgreSQL + pgvector (port 5436)
agePostgreSQL + Apache AGE (port 5435)
docker compose up -d \
  demo-world-house \
  zenoh-router zenoh-bridge \
  detector embedding-ingest graph-builder \
  vector age