Skip to main content
Open In Colab

COCO Data Pipeline for Anchor-Free Detection

Notebook 1 of 5 in the YOLOv11 from-scratch series
Modern YOLO detectors require a specialized data pipeline that goes well beyond simple image loading. The pipeline must handle several responsibilities:
  • Parsing COCO-format annotations and mapping non-contiguous category IDs to a contiguous range
  • Resizing images via letterboxing to preserve aspect ratio while fitting a fixed input resolution
  • Augmenting training data with techniques like mosaic augmentation to increase object diversity per sample
  • Encoding ground-truth bounding boxes into multi-scale target tensors suitable for anchor-free detection heads
In this notebook we build a complete COCO data pipeline for YOLOv11 training. The pipeline produces:
OutputGrid SizeStrideObject Scale
P380 x 808Small
P440 x 4016Medium
P520 x 2032Large
All outputs assume a 640 x 640 input resolution. By the end of this notebook you will have a DataLoader that yields image tensors paired with multi-scale target grids ready for training.
import os, json, random
from pathlib import Path
from typing import Dict, List, Tuple, Optional

import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
import matplotlib.patches as patches

import torch
from torch.utils.data import Dataset, IterableDataset, DataLoader

from datasets import load_dataset

# Configuration
IMG_SIZE = 640
NUM_CLASSES = 80
STRIDES = [8, 16, 32]  # P3, P4, P5
GRID_SIZES = [IMG_SIZE // s for s in STRIDES]  # 80, 40, 20
/workspaces/eng-ai-agents/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

COCO annotation format

The COCO (Common Objects in Context) dataset uses a JSON annotation format with three top-level keys:
  • images — a list of image metadata entries, each containing an id, file_name, width, and height.
  • annotations — a list of object annotations. Each annotation links to an image via image_id and contains a bbox in top-left [x, y, width, height] format, a category_id, and an iscrowd flag.
  • categories — a list of category definitions mapping id to name.
One important detail: COCO category IDs are not contiguous. For example, category IDs might jump from 1 to 16. We need to build a mapping from the original IDs to a contiguous 0..N-1 range for use in classification targets.
class COCOParser:
    """Parse COCO-format annotations."""

    def __init__(self, annotation_file: str, image_dir: str):
        with open(annotation_file, 'r') as f:
            coco = json.load(f)

        self.image_dir = image_dir
        self.images = {img['id']: img for img in coco['images']}

        # Build category mapping (COCO IDs are not contiguous)
        cat_ids = sorted([c['id'] for c in coco['categories']])
        self.cat_id_to_continuous = {cid: i for i, cid in enumerate(cat_ids)}
        self.categories = {c['id']: c['name'] for c in coco['categories']}

        # Group annotations by image
        self.img_annotations = {}
        for ann in coco['annotations']:
            if ann.get('iscrowd', 0):
                continue
            img_id = ann['image_id']
            if img_id not in self.img_annotations:
                self.img_annotations[img_id] = []
            self.img_annotations[img_id].append(ann)

        # Only keep images that have annotations
        self.img_ids = [iid for iid in self.images if iid in self.img_annotations]
        print(f"Loaded {len(self.img_ids)} images with "
              f"{sum(len(v) for v in self.img_annotations.values())} annotations")

    def get_image_path(self, img_id: int) -> str:
        return os.path.join(self.image_dir, self.images[img_id]['file_name'])

    def get_annotations(self, img_id: int) -> List[Dict]:
        return self.img_annotations.get(img_id, [])

Letterbox resizing

YOLO models expect a fixed square input (640 x 640). Naively resizing images to this shape would distort their aspect ratio, which can hurt detection accuracy — especially for objects with extreme aspect ratios. Letterboxing solves this by:
  1. Scaling the image so its longest side matches the target size.
  2. Padding the shorter side symmetrically with a neutral gray value (114) to form a square.
This preserves the original aspect ratio while fitting the model’s input dimensions. The bounding box coordinates must be adjusted to account for both the scale factor and the padding offset.
def letterbox_resize(image: np.ndarray, target_size: int = 640
                     ) -> Tuple[np.ndarray, float, Tuple[int, int]]:
    """Resize image with letterboxing (preserve aspect ratio, pad to square).

    Returns:
        resized_image: (target_size, target_size, 3) uint8 array
        scale: resize scale factor
        pad: (pad_w, pad_h) padding applied
    """
    h, w = image.shape[:2]
    scale = target_size / max(h, w)
    new_w, new_h = int(w * scale), int(h * scale)

    resized = np.array(Image.fromarray(image).resize((new_w, new_h), Image.BILINEAR))

    # Create padded image (gray padding = 114)
    padded = np.full((target_size, target_size, 3), 114, dtype=np.uint8)
    pad_w = (target_size - new_w) // 2
    pad_h = (target_size - new_h) // 2
    padded[pad_h:pad_h + new_h, pad_w:pad_w + new_w] = resized

    return padded, scale, (pad_w, pad_h)


def adjust_boxes_for_letterbox(boxes: np.ndarray, scale: float,
                                pad: Tuple[int, int]) -> np.ndarray:
    """Adjust bounding boxes after letterbox resize.

    Args:
        boxes: (N, 4) in [x_center, y_center, w, h] format (original pixel coords)
        scale: letterbox scale factor
        pad: (pad_w, pad_h)
    Returns:
        adjusted: (N, 4) in [x_center, y_center, w, h] in letterboxed image coords
    """
    adjusted = boxes.copy().astype(np.float32)
    adjusted[:, 0] = boxes[:, 0] * scale + pad[0]  # x_center
    adjusted[:, 1] = boxes[:, 1] * scale + pad[1]  # y_center
    adjusted[:, 2] = boxes[:, 2] * scale            # width
    adjusted[:, 3] = boxes[:, 3] * scale            # height
    return adjusted

Mosaic augmentation

Mosaic augmentation was introduced in YOLOv4 and remains a staple in modern YOLO training. The idea is simple but powerful: combine four randomly selected training images into a single composite image by placing each in one quadrant. Benefits:
  • More objects per sample — the model sees objects from four images in a single forward pass, which improves gradient quality.
  • Context diversity — objects appear against varied backgrounds and alongside different neighbors.
  • Reduced batch size dependence — because each sample is richer, you can train effectively with smaller batches.
  • Scale variation — objects end up at a wider range of scales than they would in isolated images.
The mosaic center is randomized to prevent the model from learning a fixed spatial prior.
def mosaic_augmentation(dataset, indices: List[int],
                        img_size: int = 640
                        ) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
    """Create a mosaic from 4 images.

    Returns:
        mosaic_img: (img_size, img_size, 3)
        mosaic_boxes: (N, 4) [x_center, y_center, w, h] normalized to [0,1]
        mosaic_labels: (N,) class indices
    """
    cx, cy = img_size // 2, img_size // 2  # mosaic center
    # Add random offset for variety
    cx += random.randint(-img_size // 4, img_size // 4)
    cy += random.randint(-img_size // 4, img_size // 4)

    mosaic_img = np.full((img_size, img_size, 3), 114, dtype=np.uint8)
    all_boxes = []
    all_labels = []

    for i, idx in enumerate(indices):
        img, boxes, labels = dataset.load_raw(idx)
        h, w = img.shape[:2]

        # Determine placement in mosaic quadrant
        if i == 0:    # top-left
            x1, y1, x2, y2 = max(cx - w, 0), max(cy - h, 0), cx, cy
            crop_x1, crop_y1 = w - (x2 - x1), h - (y2 - y1)
            crop_x2, crop_y2 = w, h
        elif i == 1:  # top-right
            x1, y1, x2, y2 = cx, max(cy - h, 0), min(cx + w, img_size), cy
            crop_x1, crop_y1 = 0, h - (y2 - y1)
            crop_x2, crop_y2 = x2 - x1, h
        elif i == 2:  # bottom-left
            x1, y1, x2, y2 = max(cx - w, 0), cy, cx, min(cy + h, img_size)
            crop_x1, crop_y1 = w - (x2 - x1), 0
            crop_x2, crop_y2 = w, y2 - y1
        else:         # bottom-right
            x1, y1, x2, y2 = cx, cy, min(cx + w, img_size), min(cy + h, img_size)
            crop_x1, crop_y1 = 0, 0
            crop_x2, crop_y2 = x2 - x1, y2 - y1

        mosaic_img[y1:y2, x1:x2] = img[crop_y1:crop_y2, crop_x1:crop_x2]

        # Adjust boxes: convert from normalized [0,1] to pixel coords in mosaic
        if len(boxes) > 0:
            pixel_boxes = boxes.copy()
            pixel_boxes[:, 0] = boxes[:, 0] * w - crop_x1 + x1  # x_center
            pixel_boxes[:, 1] = boxes[:, 1] * h - crop_y1 + y1  # y_center
            pixel_boxes[:, 2] = boxes[:, 2] * w                   # width
            pixel_boxes[:, 3] = boxes[:, 3] * h                   # height
            all_boxes.append(pixel_boxes)
            all_labels.append(labels)

    if all_boxes:
        all_boxes = np.concatenate(all_boxes, axis=0)
        all_labels = np.concatenate(all_labels, axis=0)

        # Clip to mosaic bounds and filter invalid
        x1 = all_boxes[:, 0] - all_boxes[:, 2] / 2
        y1 = all_boxes[:, 1] - all_boxes[:, 3] / 2
        x2 = all_boxes[:, 0] + all_boxes[:, 2] / 2
        y2 = all_boxes[:, 1] + all_boxes[:, 3] / 2
        x1 = np.clip(x1, 0, img_size)
        y1 = np.clip(y1, 0, img_size)
        x2 = np.clip(x2, 0, img_size)
        y2 = np.clip(y2, 0, img_size)

        all_boxes[:, 2] = x2 - x1
        all_boxes[:, 3] = y2 - y1
        all_boxes[:, 0] = (x1 + x2) / 2
        all_boxes[:, 1] = (y1 + y2) / 2

        # Filter out tiny boxes
        valid = (all_boxes[:, 2] > 2) & (all_boxes[:, 3] > 2)
        all_boxes = all_boxes[valid]
        all_labels = all_labels[valid]

        # Normalize to [0, 1]
        all_boxes[:, [0, 2]] /= img_size
        all_boxes[:, [1, 3]] /= img_size
    else:
        all_boxes = np.zeros((0, 4), dtype=np.float32)
        all_labels = np.zeros((0,), dtype=np.int64)

    return mosaic_img, all_boxes, all_labels

Multi-scale target encoding

YOLOv11 uses an anchor-free detection paradigm. Instead of pre-defined anchor boxes, each grid cell directly predicts whether it contains an object center and, if so, the bounding box parameters. The target encoding works as follows:
  1. Scale assignment — each ground-truth box is assigned to the feature pyramid level (P3, P4, or P5) whose receptive field best matches the box size. Small objects (up to 64 px) go to P3, medium objects (65-128 px) to P4, and large objects (129+ px) to P5.
  2. Grid cell assignment — within the chosen scale, the grid cell that contains the box center is designated as the positive sample.
  3. Target encoding — at the assigned grid cell, we store:
    • Objectness = 1.0 (binary indicator that this cell is responsible for an object)
    • Center offsets (cx_offset, cy_offset) — the fractional position of the box center within the grid cell, both in [0, 1]
    • Box dimensions (w, h) — normalized by the image size
    • Class label — one-hot encoded across the number of classes
The resulting target tensor at each scale has shape (grid_h, grid_w, 5 + num_classes) where the first 5 channels are [objectness, cx_offset, cy_offset, w, h].
def encode_targets(boxes: np.ndarray, labels: np.ndarray,
                   img_size: int = 640, num_classes: int = 80,
                   strides: List[int] = [8, 16, 32]) -> List[np.ndarray]:
    """Encode ground-truth boxes into multi-scale target tensors for anchor-free detection.

    Args:
        boxes: (N, 4) normalized [cx, cy, w, h] in [0, 1]
        labels: (N,) class indices
        strides: feature map strides

    Returns:
        targets: list of arrays, one per scale level
            Each has shape (grid_h, grid_w, 5 + num_classes)
            Channel layout: [obj, cx_offset, cy_offset, w, h, one_hot_classes...]
    """
    targets = []
    for stride in strides:
        grid_size = img_size // stride
        # obj(1) + box(4) + classes
        target = np.zeros((grid_size, grid_size, 5 + num_classes), dtype=np.float32)
        targets.append(target)

    for i in range(len(boxes)):
        cx, cy, w, h = boxes[i]
        cls = int(labels[i])

        # Convert to pixel coords
        cx_px = cx * img_size
        cy_px = cy * img_size
        w_px = w * img_size
        h_px = h * img_size

        # Assign to stride level based on box size
        box_size = max(w_px, h_px)
        if box_size <= 64:
            level = 0   # P3, stride 8
        elif box_size <= 128:
            level = 1   # P4, stride 16
        else:
            level = 2   # P5, stride 32

        stride = strides[level]
        grid_size = img_size // stride

        # Grid cell containing the box center
        gx = int(cx_px / stride)
        gy = int(cy_px / stride)
        gx = min(gx, grid_size - 1)
        gy = min(gy, grid_size - 1)

        # Encode: objectness, center offset within cell, box size
        targets[level][gy, gx, 0] = 1.0                # objectness
        targets[level][gy, gx, 1] = cx_px / stride - gx  # cx offset [0,1]
        targets[level][gy, gx, 2] = cy_px / stride - gy  # cy offset [0,1]
        targets[level][gy, gx, 3] = w_px / img_size     # normalized width
        targets[level][gy, gx, 4] = h_px / img_size     # normalized height
        targets[level][gy, gx, 5 + cls] = 1.0           # one-hot class

    return targets
class YOLODataset(Dataset):
    """COCO-format dataset for YOLOv11 training with mosaic augmentation."""

    def __init__(self, annotation_file: str, image_dir: str, img_size: int = 640,
                 num_classes: int = 80, augment: bool = True, mosaic_prob: float = 0.5):
        self.parser = COCOParser(annotation_file, image_dir)
        self.img_size = img_size
        self.num_classes = num_classes
        self.augment = augment
        self.mosaic_prob = mosaic_prob

    def __len__(self):
        return len(self.parser.img_ids)

    def load_raw(self, idx: int) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
        """Load image and annotations without augmentation."""
        img_id = self.parser.img_ids[idx]
        img = np.array(Image.open(self.parser.get_image_path(img_id)).convert('RGB'))
        anns = self.parser.get_annotations(img_id)

        boxes = []
        labels = []
        h, w = img.shape[:2]
        for ann in anns:
            x, y, bw, bh = ann['bbox']  # COCO format: top-left x, y, w, h
            # Convert to center format and normalize
            cx = (x + bw / 2) / w
            cy = (y + bh / 2) / h
            bw = bw / w
            bh = bh / h
            if bw > 0 and bh > 0:
                boxes.append([cx, cy, bw, bh])
                labels.append(self.parser.cat_id_to_continuous[ann['category_id']])

        boxes = np.array(boxes, dtype=np.float32) if boxes else np.zeros((0, 4), dtype=np.float32)
        labels = np.array(labels, dtype=np.int64) if labels else np.zeros((0,), dtype=np.int64)
        return img, boxes, labels

    def __getitem__(self, idx):
        if self.augment and random.random() < self.mosaic_prob:
            indices = [idx] + [random.randint(0, len(self) - 1) for _ in range(3)]
            img, boxes, labels = mosaic_augmentation(self, indices, self.img_size)
        else:
            img, boxes, labels = self.load_raw(idx)
            img, scale, pad = letterbox_resize(img, self.img_size)
            if len(boxes) > 0:
                # Convert normalized boxes to pixel, adjust for letterbox, re-normalize
                orig_img = np.array(Image.open(
                    self.parser.get_image_path(self.parser.img_ids[idx])).convert('RGB'))
                oh, ow = orig_img.shape[:2]
                pixel_boxes = boxes.copy()
                pixel_boxes[:, 0] *= ow
                pixel_boxes[:, 1] *= oh
                pixel_boxes[:, 2] *= ow
                pixel_boxes[:, 3] *= oh
                boxes = adjust_boxes_for_letterbox(pixel_boxes, scale, pad)
                boxes[:, [0, 2]] /= self.img_size
                boxes[:, [1, 3]] /= self.img_size

        # Encode multi-scale targets
        targets = encode_targets(boxes, labels, self.img_size, self.num_classes)

        # To tensor
        img_tensor = torch.from_numpy(img).permute(2, 0, 1).float() / 255.0
        target_tensors = [torch.from_numpy(t) for t in targets]

        return img_tensor, target_tensors, torch.from_numpy(boxes), torch.from_numpy(labels)


def yolo_collate_fn(batch):
    """Custom collate: stack images, list targets (variable bbox count)."""
    imgs = torch.stack([b[0] for b in batch])
    targets_p3 = torch.stack([b[1][0] for b in batch])
    targets_p4 = torch.stack([b[1][1] for b in batch])
    targets_p5 = torch.stack([b[1][2] for b in batch])
    boxes = [b[2] for b in batch]      # list of variable-length tensors
    labels = [b[3] for b in batch]     # list of variable-length tensors
    return imgs, [targets_p3, targets_p4, targets_p5], boxes, labels

Visualization utilities

The following helper functions let us inspect the pipeline output visually. The first function draws bounding boxes on an image tensor, and the second displays the objectness maps at each feature pyramid level.
def visualize_sample(img_tensor, boxes, labels, category_names=None, ax=None):
    """Visualize an image with bounding boxes."""
    if ax is None:
        fig, ax = plt.subplots(1, 1, figsize=(10, 10))

    img = img_tensor.permute(1, 2, 0).numpy()
    ax.imshow(img)

    colors = plt.cm.Set3(np.linspace(0, 1, 80))

    for i in range(len(boxes)):
        cx, cy, w, h = boxes[i].numpy()
        # Convert from normalized to pixel
        cx *= IMG_SIZE; cy *= IMG_SIZE; w *= IMG_SIZE; h *= IMG_SIZE
        x1 = cx - w / 2
        y1 = cy - h / 2

        cls = int(labels[i])
        color = colors[cls % len(colors)]
        rect = patches.Rectangle((x1, y1), w, h, linewidth=2,
                                  edgecolor=color, facecolor='none')
        ax.add_patch(rect)
        name = category_names.get(cls, str(cls)) if category_names else str(cls)
        ax.text(x1, y1 - 5, name, color='white', fontsize=8,
                bbox=dict(boxstyle='round,pad=0.2', facecolor=color, alpha=0.7))

    ax.axis('off')
    return ax
def visualize_targets(targets, strides=[8, 16, 32]):
    """Visualize objectness maps at each scale."""
    fig, axes = plt.subplots(1, 3, figsize=(18, 6))

    for i, (target, stride) in enumerate(zip(targets, strides)):
        obj_map = target[:, :, 0] if isinstance(target, np.ndarray) else target[..., 0].numpy()
        axes[i].imshow(obj_map, cmap='hot', interpolation='nearest')
        axes[i].set_title(f'P{i+3} (stride={stride}, grid={obj_map.shape[0]}x{obj_map.shape[1]})')
        axes[i].set_xlabel(f'{int(obj_map.sum())} objects assigned')

    plt.suptitle('Multi-Scale Target Assignment (Objectness Maps)', fontsize=14)
    plt.tight_layout()
    plt.show()

Loading real COCO data via Hugging Face streaming

Instead of creating synthetic images with colored rectangles, we stream real COCO images directly from detection-datasets/coco on the Hugging Face Hub. This requires no local download — images are fetched on-the-fly.
Data source: Images streamed from detection-datasets/coco. See our HF COCO streaming tutorial for details.
The streaming dataset wraps the HF iterable as a PyTorch IterableDataset, converting annotations from COCO format ([x, y, w, h] with top-left origin) to YOLO format ([cx, cy, w, h] normalized, 0-indexed labels). It applies the same letterbox resize and multi-scale target encoding as the disk-based YOLODataset. Note: Mosaic augmentation requires random access to the dataset, which is incompatible with IterableDataset. The streaming demo skips mosaic; mosaic augmentation is already demonstrated above with the disk-based YOLODataset.
# COCO class names (80 categories, 0-indexed as provided by the HF dataset)
COCO_NAMES = [
    'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck',
    'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench',
    'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra',
    'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee',
    'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove',
    'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup',
    'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange',
    'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch',
    'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse',
    'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink',
    'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear',
    'hair drier', 'toothbrush'
]


def transform_to_yolo(example):
    """Convert a single HF COCO example to YOLO format.

    The HF dataset provides bounding boxes in COCO format [x, y, w, h] (pixels,
    top-left corner) with 0-indexed category labels. We convert to YOLO format
    [cx, cy, w, h] (normalized) with the same 0-indexed labels.
    """
    img = np.array(example['image'].convert('RGB'))
    h, w = img.shape[:2]

    bboxes = example['objects']['bbox']
    cats = example['objects']['category']

    boxes = []
    labels = []
    for bbox, cat_id in zip(bboxes, cats):
        bx, by, bw, bh = bbox
        if bw <= 0 or bh <= 0:
            continue
        cx = (bx + bw / 2) / w
        cy = (by + bh / 2) / h
        boxes.append([cx, cy, bw / w, bh / h])
        labels.append(int(cat_id))

    return {
        'image': img,
        'boxes': np.array(boxes, dtype=np.float32) if boxes else np.zeros((0, 4), dtype=np.float32),
        'labels': np.array(labels, dtype=np.int64) if labels else np.zeros((0,), dtype=np.int64),
    }


class COCOStreamYOLODataset(IterableDataset):
    """Stream COCO from Hugging Face and yield YOLO-format training samples.

    Each sample goes through letterbox resize and multi-scale target encoding,
    identical to the disk-based YOLODataset above. Mosaic augmentation is skipped
    because it requires random access, which is incompatible with streaming.
    """

    def __init__(self, split='train', max_samples=None, img_size=640, num_classes=80):
        self.split = split
        self.max_samples = max_samples
        self.img_size = img_size
        self.num_classes = num_classes

    def __iter__(self):
        ds = load_dataset('detection-datasets/coco', split=self.split, streaming=True)

        count = 0
        for example in ds:
            if self.max_samples and count >= self.max_samples:
                break

            parsed = transform_to_yolo(example)
            img = parsed['image']
            boxes = parsed['boxes']
            labels = parsed['labels']

            if len(boxes) == 0:
                continue

            # Letterbox resize (same as disk-based pipeline)
            orig_h, orig_w = img.shape[:2]
            img, scale, pad = letterbox_resize(img, self.img_size)

            # Adjust boxes for letterbox
            pixel_boxes = boxes.copy()
            pixel_boxes[:, 0] *= orig_w
            pixel_boxes[:, 1] *= orig_h
            pixel_boxes[:, 2] *= orig_w
            pixel_boxes[:, 3] *= orig_h
            boxes = adjust_boxes_for_letterbox(pixel_boxes, scale, pad)
            boxes[:, [0, 2]] /= self.img_size
            boxes[:, [1, 3]] /= self.img_size

            # Encode multi-scale targets
            targets = encode_targets(boxes, labels, self.img_size, self.num_classes)

            img_tensor = torch.from_numpy(img).permute(2, 0, 1).float() / 255.0
            target_tensors = [torch.from_numpy(t) for t in targets]

            yield img_tensor, target_tensors, torch.from_numpy(boxes), torch.from_numpy(labels)
            count += 1


# Stream 16 real COCO images for demonstration
stream_dataset = COCOStreamYOLODataset(split='train', max_samples=16)
stream_loader = DataLoader(stream_dataset, batch_size=4, collate_fn=yolo_collate_fn, num_workers=0)

print("Streaming real COCO images from Hugging Face...")
batch = next(iter(stream_loader))
imgs, targets, boxes_list, labels_list = batch

print(f"Image batch shape: {imgs.shape}")
for i, t in enumerate(targets):
    print(f"Target P{i+3} shape: {t.shape}")
print(f"Objects per image: {[len(b) for b in boxes_list]}")
Streaming real COCO images from Hugging Face...
Image batch shape: torch.Size([4, 3, 640, 640])
Target P3 shape: torch.Size([4, 80, 80, 85])
Target P4 shape: torch.Size([4, 40, 40, 85])
Target P5 shape: torch.Size([4, 20, 20, 85])
Objects per image: [8, 2, 2, 1]
# Visualize real COCO images with ground-truth boxes
cat_names = {i: name for i, name in enumerate(COCO_NAMES)}

fig, axes = plt.subplots(2, 2, figsize=(16, 16))
for i in range(min(4, len(imgs))):
    ax = axes[i // 2][i % 2]
    visualize_sample(imgs[i], boxes_list[i], labels_list[i], cat_names, ax=ax)
    ax.set_title(f'COCO Sample {i} ({len(boxes_list[i])} objects)')

plt.suptitle('Real COCO Images via HF Streaming', fontsize=16)
plt.tight_layout()
plt.show()
Output from cell 10
# Visualize target grids for first sample
sample_targets = [t[0] for t in targets]  # first sample in batch
visualize_targets(sample_targets)
Output from cell 11

DataLoader performance considerations

When training on real data with thousands of images, DataLoader configuration has a significant impact on GPU utilization:
  • num_workers — set this to the number of CPU cores available (typically 4-8). Each worker runs in a separate process and pre-loads batches in parallel. Setting this too high can cause memory issues.
  • pin_memory=True — enables pinned (page-locked) memory for faster CPU-to-GPU transfers. Always use this when training on a GPU.
  • persistent_workers=True — keeps worker processes alive between epochs, avoiding the overhead of re-spawning them. Requires num_workers > 0.
  • drop_last=True — drops the final incomplete batch, which prevents shape mismatches in batch normalization layers.
# Performance configuration for real training
def create_train_loader(annotation_file, image_dir, batch_size=16, num_workers=4):
    """Create an optimized DataLoader for training."""
    dataset = YOLODataset(
        annotation_file, image_dir,
        img_size=IMG_SIZE,
        num_classes=NUM_CLASSES,
        augment=True,
        mosaic_prob=0.5
    )
    return DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=True,
        num_workers=num_workers,
        pin_memory=True,
        collate_fn=yolo_collate_fn,
        drop_last=True,
        persistent_workers=True if num_workers > 0 else False
    )


print("Data pipeline complete!")
print(f"Input: COCO-format annotations + images")
print(f"Output: {IMG_SIZE}x{IMG_SIZE} images with multi-scale targets")
print(f"  P3: {GRID_SIZES[0]}x{GRID_SIZES[0]} (stride {STRIDES[0]}) - small objects")
print(f"  P4: {GRID_SIZES[1]}x{GRID_SIZES[1]} (stride {STRIDES[1]}) - medium objects")
print(f"  P5: {GRID_SIZES[2]}x{GRID_SIZES[2]} (stride {STRIDES[2]}) - large objects")
Data pipeline complete!
Input: COCO-format annotations + images
Output: 640x640 images with multi-scale targets
  P3: 80x80 (stride 8) - small objects
  P4: 40x40 (stride 16) - medium objects
  P5: 20x20 (stride 32) - large objects
# Streaming alternative: no local files needed
def create_stream_train_loader(split='train', max_samples=None, batch_size=16):
    """Create a DataLoader that streams COCO from Hugging Face.

    Unlike create_train_loader above, this requires no local annotation file
    or image directory. Images are fetched on-the-fly from the HF Hub.
    Mosaic augmentation is not available in streaming mode.
    """
    dataset = COCOStreamYOLODataset(
        split=split,
        max_samples=max_samples,
        img_size=IMG_SIZE,
        num_classes=NUM_CLASSES,
    )
    return DataLoader(
        dataset,
        batch_size=batch_size,
        collate_fn=yolo_collate_fn,
        num_workers=0,  # streaming is single-threaded
    )


print("Streaming train loader factory ready.")
print("Usage: loader = create_stream_train_loader(max_samples=100)")
Streaming train loader factory ready.
Usage: loader = create_stream_train_loader(max_samples=100)

Summary

In this notebook we built a complete COCO data pipeline for anchor-free YOLOv11 training. The key components are:
  1. COCOParser — reads COCO JSON annotations, maps non-contiguous category IDs to a contiguous range, and groups annotations by image.
  2. Letterbox resize — scales images to 640 x 640 while preserving aspect ratio with symmetric gray padding.
  3. Mosaic augmentation — combines four training images into a single composite to increase object diversity and context variation.
  4. Multi-scale target encoding — assigns each ground-truth box to the appropriate feature pyramid level (P3/P4/P5) and encodes objectness, center offsets, box dimensions, and class labels into dense grid targets.
  5. YOLODataset + DataLoader — wraps everything into a PyTorch Dataset with a custom collate function that handles variable numbers of objects per image.
Next up: In Notebook 2 we will build the YOLOv11 backbone network that processes these 640 x 640 images and produces the P3, P4, and P5 feature maps that our detection heads will operate on.