Overview and learning objectives
In this assignment, you will build a visual retrieval system that answers the following query: Given a single image of a car exterior component, retrieve the video clip(s) in which that component appears. The system must operate by detecting semantic content in a video stream and matching it against image-based queries. The emphasis is on representation, indexing, and retrieval—not end-to-end supervised training. By completing this assignment, you will learn to:- Use an object detector to extract semantic structure from video.
- Index detections in a form suitable for retrieval.
- Perform image-to-video semantic search using shared representations.
- Produce machine-readable outputs for downstream evaluation.
Provided data
1. Training dataset (for detector selection only)
Students may use the following dataset to select and configure an object detector:- Ultralytics Car Parts Segmentation Dataset https://docs.ultralytics.com/datasets/segment/carparts-seg/
2. Input video (retrieval corpus)
- A single car exterior video will be provided:
3. Query images (semantic search)
Query images will be drawn from a slightly different distribution than the video frames. These query images are available as a public Hugging Face dataset:- Dataset: aegean-ai/rav4-exterior-images
| Column | Type | Description |
|---|---|---|
image | Image | The extracted frame (JPEG) |
timestamp | string | Time position in the source video (MM:SS) |
timestamp_sec | int | Time position in seconds |
exterior_score | float | CLIP zero-shot classification confidence that the frame shows a car exterior |
width | int | Frame width in pixels |
height | int | Frame height in pixels |
video_title | string | Title of the source YouTube video |
openai/clip-vit-base-patch32) with an exterior confidence threshold of 0.90. Only frames classified as exterior views with high confidence are included.
You can load the dataset in Python with:
Task definition
You will build a system that performs image-based semantic search over a video. Given:- a query image of a car exterior component,
- Identify which semantic component(s) appear in the query image.
- Retrieve the corresponding video clip(s) in which that component is visible.
-
Return the matching clip and its temporal extent — that is, the start and end timestamps (in seconds) of every contiguous segment where the queried component is detected. You can verify your results visually using the YouTube embed URL with
startandendparameters. For example, to check a clip from 2:00 to 2:45, open:This plays only the specified interval, letting you prove that the returned segment actually contains the queried component.
System requirements
1. Video processing and detection
You must:- Sample frames from the input video.
- Run an object detector on each frame.
- Produce bounding boxes and class labels for detected exterior components.
2. Image semantic search
For each query image:- Run the same detector (or a compatible image encoder).
- Identify the detected component class(es).
- Match these against detected components in the video index.
- Retrieve contiguous time intervals where the component is present.
3. Output format (required)
All detection results must be uploaded to Hugging Face as a Parquet file. Each row in the Parquet file must correspond to a single detection in the video and contain at least the following fields:- video_id
- frame_index or timestamp
- class_label
- bounding_box (x_min, y_min, x_max, y_max)
- confidence_score
Retrieval output
For each query image, your system must return:- start_timestamp
- end_timestamp
- class_label used for retrieval
- number_of_supporting_detections
Evaluation criteria
You will be graded on:- Correctness Do the retrieved clips actually contain the queried component?
- Temporal coherence Are clips reasonably contiguous, or overly fragmented?
- Detection quality Are detections consistent and semantically meaningful?
- Data engineering quality Is the Parquet schema clean, well-documented, and reproducible?
- Report clarity Can you clearly explain how image queries are matched to video content?
Restrictions
- You may not manually label frames from the video.
- You may not hard-code timestamps for specific components.
- You may not use query-specific heuristics.
Deliverables
-
A Hugging Face repository containing:
- The Parquet file with video detections.
- A short README describing the schema.
-
A short report (3–4 pages) explaining:
- Detector choice and configuration.
- Video sampling strategy.
- Image-to-video matching logic.
- Failure cases and limitations.

