Architecture
The detection pipeline sits outside the ROS 2 container. It subscribes to camera images and odometry via Zenoh, runs inference on GPU, and publishes enriched detections back to Zenoh. A separate ingest worker writes embeddings to pgvector.CDR Deserialization
The zenoh-bridge-ros2dds forwards raw CDR (Common Data Representation) bytes from DDS to Zenoh without re-encoding. CDR is the binary serialization format used by DDS, the middleware layer beneath ROS 2. The detector uses thepycdr2 Python library to deserialize these bytes directly into Python dataclasses - no ROS 2 installation required.
Each ROS 2 message type maps to a pycdr2 dataclass:
Keyframe Gating
Not every camera frame warrants running inference. The camera streams at ~30 fps, but consecutive frames from a slow-moving robot are nearly identical. Keyframe gating selects frames worth processing based on how much the robot has moved since the last processed frame.Per-frame decision flow
- A camera frame arrives (rate-limited to 10 Hz)
- The detector checks the latest cached pose from the odometry subscriber
- It computes the distance and angle change since the last accepted keyframe
- If the robot has moved less than 0.5 m AND rotated less than 15 degrees, the frame is discarded - no inference runs
- If either threshold is exceeded, this frame becomes a keyframe - YOLO and CLIP run
Detection and Embedding Extraction
Each keyframe goes through two models in sequence:YOLOv8 (detection)
YOLOv8 nano runs on the full frame and produces bounding boxes with class labels and confidence scores. The model is pretrained on COCO’s 80 object classes.CLIP (embedding)
For each detection, the bounding box region is cropped from the original frame and passed through a CLIP image encoder (ViT-B/32, pretrained on LAION-2B). This produces a 512-dimensional embedding vector that captures the visual semantics of the detected object.Why CLIP instead of YOLOv8 backbone features
YOLOv8’s backbone produces spatial feature maps optimized for detection, not for cross-instance visual similarity. CLIP embeddings are trained on image-text pairs across millions of concepts, making them effective for:- Matching the same physical object seen from different angles
- Comparing objects across different runs (Run A mapping vs Run B re-localization)
- KNN search in pgvector for semantic re-localization
Zenoh Payload Format
The detector publishes a JSON envelope to thetb/detections Zenoh key for each keyframe:
Vector Database Storage
An ingest worker subscribes totb/detections on the Zenoh router and writes each detection with its embedding and pose metadata to PostgreSQL with pgvector.
Schema
Querying similar objects
Once embeddings are stored, finding visually similar detections is a single SQL query:<=> operator computes cosine distance. Because the embeddings are L2-normalized, this is equivalent to 1 - dot_product.
Docker Services
The pipeline is defined indocker-compose.yaml with these services:
| Service | Role |
|---|---|
demo-world-enhanced | Gazebo simulation with Nav2 |
zenoh-router | Zenoh message bus with in-memory storage |
zenoh-bridge | DDS to Zenoh bridge (forwards camera, odom, detections) |
detector | YOLOv8 + CLIP keyframe detector (GPU) |
embedding-ingest | Zenoh subscriber that writes to pgvector |
vector | PostgreSQL with pgvector extension |
detection-logger | Appends raw detections to JSONL files |

