Skip to main content
This page describes the methodology for building a keyframe-gated object detection pipeline that produces CLIP embeddings and stores them in a vector database. The pipeline runs alongside a ROS 2 simulated robot without depending on ROS 2 itself - it uses Zenoh as the transport layer and PostgreSQL with pgvector for persistent storage. The reference implementation is in the turtlebot-maze repository.

Architecture

The detection pipeline sits outside the ROS 2 container. It subscribes to camera images and odometry via Zenoh, runs inference on GPU, and publishes enriched detections back to Zenoh. A separate ingest worker writes embeddings to pgvector.

CDR Deserialization

The zenoh-bridge-ros2dds forwards raw CDR (Common Data Representation) bytes from DDS to Zenoh without re-encoding. CDR is the binary serialization format used by DDS, the middleware layer beneath ROS 2. The detector uses the pycdr2 Python library to deserialize these bytes directly into Python dataclasses - no ROS 2 installation required. Each ROS 2 message type maps to a pycdr2 dataclass:
from pycdr2 import IdlStruct
from pycdr2.types import uint8, uint32, int32, float64

@dataclass
class Image(IdlStruct, typename="sensor_msgs/msg/Image"):
    header: Header
    height: uint32
    width: uint32
    encoding: str
    is_bigendian: uint8
    step: uint32
    data: List[uint8]

@dataclass
class Odometry(IdlStruct, typename="nav_msgs/msg/Odometry"):
    header: Header
    child_frame_id: str
    pose: PoseWithCovariance
    twist: TwistWithCovariance
Deserialization in the callback:
img_msg = Image.deserialize(bytes(sample.payload))
odom_msg = Odometry.deserialize(bytes(sample.payload))

Keyframe Gating

Not every camera frame warrants running inference. The camera streams at ~30 fps, but consecutive frames from a slow-moving robot are nearly identical. Keyframe gating selects frames worth processing based on how much the robot has moved since the last processed frame.

Per-frame decision flow

  1. A camera frame arrives (rate-limited to 10 Hz)
  2. The detector checks the latest cached pose from the odometry subscriber
  3. It computes the distance and angle change since the last accepted keyframe
  4. If the robot has moved less than 0.5 m AND rotated less than 15 degrees, the frame is discarded - no inference runs
  5. If either threshold is exceeded, this frame becomes a keyframe - YOLO and CLIP run
def is_keyframe() -> bool:
    dx = latest_pose[0] - last_keyframe_pose[0]
    dy = latest_pose[1] - last_keyframe_pose[1]
    dist = math.sqrt(dx * dx + dy * dy)
    dyaw = abs(latest_pose[2] - last_keyframe_pose[2])
    dyaw = abs(math.atan2(math.sin(dyaw), math.cos(dyaw)))
    if dist >= 0.5 or dyaw >= math.radians(15):
        last_keyframe_pose = latest_pose
        return True
    return False
When the robot is stationary, zero frames pass the gate - no GPU compute is used. At typical navigation speeds (~0.2 m/s), a keyframe fires roughly every 2-3 seconds.

Detection and Embedding Extraction

Each keyframe goes through two models in sequence:

YOLOv8 (detection)

YOLOv8 nano runs on the full frame and produces bounding boxes with class labels and confidence scores. The model is pretrained on COCO’s 80 object classes.
results = model.predict(frame, conf=0.3, verbose=False)

CLIP (embedding)

For each detection, the bounding box region is cropped from the original frame and passed through a CLIP image encoder (ViT-B/32, pretrained on LAION-2B). This produces a 512-dimensional embedding vector that captures the visual semantics of the detected object.
import open_clip

clip_model, _, preprocess = open_clip.create_model_and_transforms(
    "ViT-B-32", pretrained="laion2b_s34b_b79k"
)

# For each detection crop:
batch = torch.stack([preprocess(crop) for crop in crops]).to(device)
embeddings = clip_model.encode_image(batch)
embeddings = F.normalize(embeddings, dim=-1)  # L2-normalize
All crops from a single keyframe are batched into one forward pass for efficiency. The embeddings are L2-normalized so that cosine similarity reduces to a dot product - this is what pgvector’s vector operators expect.

Why CLIP instead of YOLOv8 backbone features

YOLOv8’s backbone produces spatial feature maps optimized for detection, not for cross-instance visual similarity. CLIP embeddings are trained on image-text pairs across millions of concepts, making them effective for:
  • Matching the same physical object seen from different angles
  • Comparing objects across different runs (Run A mapping vs Run B re-localization)
  • KNN search in pgvector for semantic re-localization

Zenoh Payload Format

The detector publishes a JSON envelope to the tb/detections Zenoh key for each keyframe:
{
  "keyframe_id": 42,
  "timestamp": 1711281234.567,
  "map_x": 2.14,
  "map_y": 0.83,
  "map_yaw": 0.785,
  "detections": [
    {
      "class": "cup",
      "confidence": 0.87,
      "bbox": [120.0, 80.0, 250.0, 310.0],
      "embedding": "<base64-encoded 512 x float32>",
      "embedding_dim": 512,
      "embedding_model": "ViT-B-32/laion2b_s34b_b79k"
    }
  ]
}
The embedding field is a base64-encoded byte string of 512 little-endian float32 values (2048 bytes raw). The ingest worker decodes this before writing to pgvector.

Vector Database Storage

An ingest worker subscribes to tb/detections on the Zenoh router and writes each detection with its embedding and pose metadata to PostgreSQL with pgvector.

Schema

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE detection_embeddings (
  det_pk bigserial PRIMARY KEY,
  keyframe_id integer,
  class_name text NOT NULL,
  confidence real NOT NULL,
  bbox real[] NOT NULL,
  map_x real,
  map_y real,
  map_yaw real,
  embedding_model text,
  embedding vector(512),
  ingested_at timestamptz NOT NULL DEFAULT now()
);

Querying similar objects

Once embeddings are stored, finding visually similar detections is a single SQL query:
SELECT class_name, confidence, map_x, map_y,
       embedding <=> '[0.012, -0.034, ...]'::vector AS distance
FROM detection_embeddings
ORDER BY distance
LIMIT 5;
The <=> operator computes cosine distance. Because the embeddings are L2-normalized, this is equivalent to 1 - dot_product.

Docker Services

The pipeline is defined in docker-compose.yaml with these services:
ServiceRole
demo-world-enhancedGazebo simulation with Nav2
zenoh-routerZenoh message bus with in-memory storage
zenoh-bridgeDDS to Zenoh bridge (forwards camera, odom, detections)
detectorYOLOv8 + CLIP keyframe detector (GPU)
embedding-ingestZenoh subscriber that writes to pgvector
vectorPostgreSQL with pgvector extension
detection-loggerAppends raw detections to JSONL files
To bring up the full pipeline:
docker compose up -d \
  demo-world-enhanced \
  zenoh-router zenoh-bridge \
  detector embedding-ingest \
  vector detection-logger