Objectives
- Build a semantic spatial graph from robot detection events using PostgreSQL with Apache AGE (graph database extension) and pgvector (vector embeddings)
- Perform semantic re-localization - given a robot placed at an unknown pose, use visual similarity (CLIP embeddings) and graph structure to infer where the robot likely is
High-level Workflow
Key Concepts
Keyframe - A camera frame selected for storage. The robot’s camera streams at ~30 fps, but consecutive frames are nearly identical. A keyframe is sampled when the robot has moved a minimum distance (e.g. 0.5 m), rotated a minimum angle, or a time interval has elapsed. Each keyframe captures what the robot sees at a specific moment and location. Pose - The robot’s position and orientation in the map frame at the moment a keyframe was captured:(map_x, map_y, map_yaw, timestamp). This comes from SLAM or localization against a known map (Nav2 publishes via the map → odom → base_link TF chain). The pose anchors every keyframe to a physical location in the world.
Observation - When YOLOv8 runs on a keyframe and detects an object (e.g. “cup at bbox with 87% confidence”), that detection is an observation. Each keyframe can produce zero or many observations. Each observation also carries a CLIP embedding - a 512-dimensional vector computed from the cropped detection region.
Object (Landmark) - The same physical cup is detected in multiple keyframes as the robot passes it from different angles. Object landmark fusion merges these repeated observations into a single entity using CLIP embedding similarity (cosine distance above a threshold) combined with spatial proximity (positions close in map frame). The fused object maintains: class label, mean position, observation count, and first/last seen timestamps.
Place - The map is partitioned into places - spatial clusters of keyframe poses. This can be done with grid binning (e.g. 1 m × 1 m cells) or DBSCAN (density-based clustering). Each keyframe belongs to exactly one place. A place represents a navigable zone such as “kitchen corner” or “hallway segment.”
Run - A single exploration session: one continuous period of the robot navigating. Run A is the mapping run that builds the semantic graph, Run B is the re-localization run that queries it.
Per-Frame Processing Flow
The detector subscribes to two Zenoh topics: camera images (at 10 Hz max) and odometry. For each incoming camera frame:- Rate limit check - if less than 100 ms since the last frame was considered, skip immediately
- Cache the latest robot pose from the odometry subscriber (runs independently, updating
(x, y, yaw)on every odom message) - Keyframe gate - compare the current pose against the pose of the last accepted keyframe:
- If the robot has moved less than 0.5 m AND rotated less than 15 degrees, discard the frame (no inference)
- If either threshold is exceeded, this frame is a keyframe - proceed to step 4
- Deserialize the CDR-encoded image and decode it to a numpy array
- Run YOLOv8 inference on the frame to produce bounding boxes
- For each detection, crop the bounding box region from the frame
- Batch all crops through the CLIP encoder (ViT-B/32) to produce 512-dim L2-normalized embeddings
- Publish a JSON envelope to
tb/detectionscontaining:keyframe_id- monotonically increasing integertimestamp- wall clock timemap_x,map_y,map_yaw- robot pose in map framedetections- array of\{class, confidence, bbox, embedding, embedding_dim, embedding_model\}
Property Graph Structure
The entities above are linked in Apache AGE as a property graph:- Run → Keyframe: this keyframe was captured during this run
- Keyframe → Pose: the robot was at this map-frame position when the keyframe was captured
- Keyframe → Observation: this object detection was produced from this keyframe
- Observation → Object: this observation corresponds to this fused physical landmark
- Object → Place: this object is located in this place
- Place → Place (adjacent): these places are directly reachable from each other
- Runs KNN search in pgvector to find the most visually similar stored embeddings
- Follows graph edges from those embeddings to their Objects, then to their Places
- Ranks places by aggregated similarity score
- Outputs the top-3 candidate places and a pose hypothesis
Background - Semantic + ROS2 Mapping Concepts
For Zenoh background videos, see the Zenoh middleware page.ROS 2 Mapping and Frame Semantics (map vs odom)
Consult REP 105 - Coordinate Frames for Mobile Platforms and the Nav2 Transforms Setup Guide for the relationship between themap, odom, and base_link frames used in this assignment.
1. Global Pose Requirement
Produce map-frame poses for keyframes using one:- SLAM toolbox mapping mode, or
- Localization against a known map
- map_x
- map_y
- map_yaw
- timestamp
2. Vector Embeddings (pgvector)
You will:- Crop each detection bounding box
- Compute an embedding (CLIP or similar)
- Store vectors in pgvector
3. Semantic Graph with Apache AGE
Enable AGE per session:Node types
- Run
- Keyframe
- Pose (map frame)
- Place
- Object
- Observation
Edge types
- Run → Keyframe
- Keyframe → Pose
- Keyframe → Observation
- Observation → Object
- Object → Place
- Place → Place (adjacent)
4. Place Construction
Cluster map-frame poses into places using:- grid binning, or
- DBSCAN
5. Object Landmark Fusion
Merge observations into object landmarks using:- spatial distance threshold
- embedding similarity threshold
- class
- mean position
- first_seen / last_seen
- observation count
6. Semantic Re-Localization Task
Scenario
Run A:- Explore maze
- Build semantic graph
- Restart at unknown pose
- Capture 1–3 object crops
- Compute embeddings
- Run vector KNN search
- Infer top-3 likely places
- Output best pose hypothesis
Minimal algorithm
- Compute embeddings for query crops
- KNN search in pgvector
- Join to Object → Place
- Rank places by similarity score
Required Queries
Provide working scripts for:Vector
Top-k visually similar detections for a query crop.Graph
Reachable places within N hops containing an object class.Re-localization
Top-3 candidate places from query crops.Deliverables
- pgvector + AGE schema
- Embedding generator
- Graph builder
- Semantic relocalizer
- Demo report with success and failure analysis

