COCO Data Pipeline for Anchor-Free Detection
Notebook 1 of 5 in the YOLOv11 from-scratch seriesModern YOLO detectors require a specialized data pipeline that goes well beyond simple image loading. The pipeline must handle several responsibilities:
- Parsing COCO-format annotations and mapping non-contiguous category IDs to a contiguous range
- Resizing images via letterboxing to preserve aspect ratio while fitting a fixed input resolution
- Augmenting training data with techniques like mosaic augmentation to increase object diversity per sample
- Encoding ground-truth bounding boxes into multi-scale target tensors suitable for anchor-free detection heads
| Output | Grid Size | Stride | Object Scale |
|---|---|---|---|
| P3 | 80 x 80 | 8 | Small |
| P4 | 40 x 40 | 16 | Medium |
| P5 | 20 x 20 | 32 | Large |
DataLoader that yields image tensors paired with multi-scale target grids ready for training.
COCO annotation format
The COCO (Common Objects in Context) dataset uses a JSON annotation format with three top-level keys:images— a list of image metadata entries, each containing anid,file_name,width, andheight.annotations— a list of object annotations. Each annotation links to an image viaimage_idand contains abboxin top-left[x, y, width, height]format, acategory_id, and aniscrowdflag.categories— a list of category definitions mappingidtoname.
0..N-1 range for use in classification targets.
Letterbox resizing
YOLO models expect a fixed square input (640 x 640). Naively resizing images to this shape would distort their aspect ratio, which can hurt detection accuracy — especially for objects with extreme aspect ratios. Letterboxing solves this by:- Scaling the image so its longest side matches the target size.
- Padding the shorter side symmetrically with a neutral gray value (114) to form a square.
Mosaic augmentation
Mosaic augmentation was introduced in YOLOv4 and remains a staple in modern YOLO training. The idea is simple but powerful: combine four randomly selected training images into a single composite image by placing each in one quadrant. Benefits:- More objects per sample — the model sees objects from four images in a single forward pass, which improves gradient quality.
- Context diversity — objects appear against varied backgrounds and alongside different neighbors.
- Reduced batch size dependence — because each sample is richer, you can train effectively with smaller batches.
- Scale variation — objects end up at a wider range of scales than they would in isolated images.
Multi-scale target encoding
YOLOv11 uses an anchor-free detection paradigm. Instead of pre-defined anchor boxes, each grid cell directly predicts whether it contains an object center and, if so, the bounding box parameters. The target encoding works as follows:- Scale assignment — each ground-truth box is assigned to the feature pyramid level (P3, P4, or P5) whose receptive field best matches the box size. Small objects (up to 64 px) go to P3, medium objects (65-128 px) to P4, and large objects (129+ px) to P5.
- Grid cell assignment — within the chosen scale, the grid cell that contains the box center is designated as the positive sample.
-
Target encoding — at the assigned grid cell, we store:
- Objectness = 1.0 (binary indicator that this cell is responsible for an object)
- Center offsets (cx_offset, cy_offset) — the fractional position of the box center within the grid cell, both in [0, 1]
- Box dimensions (w, h) — normalized by the image size
- Class label — one-hot encoded across the number of classes
(grid_h, grid_w, 5 + num_classes) where the first 5 channels are [objectness, cx_offset, cy_offset, w, h].
Visualization utilities
The following helper functions let us inspect the pipeline output visually. The first function draws bounding boxes on an image tensor, and the second displays the objectness maps at each feature pyramid level.Loading real COCO data via Hugging Face streaming
Instead of creating synthetic images with colored rectangles, we stream real COCO images directly from detection-datasets/coco on the Hugging Face Hub. This requires no local download — images are fetched on-the-fly.Data source: Images streamed from detection-datasets/coco. See our HF COCO streaming tutorial for details.The streaming dataset wraps the HF iterable as a PyTorch
IterableDataset, converting annotations from COCO format ([x, y, w, h] with top-left origin) to YOLO format ([cx, cy, w, h] normalized, 0-indexed labels). It applies the same letterbox resize and multi-scale target encoding as the disk-based YOLODataset.
Note: Mosaic augmentation requires random access to the dataset, which is incompatible with IterableDataset. The streaming demo skips mosaic; mosaic augmentation is already demonstrated above with the disk-based YOLODataset.


DataLoader performance considerations
When training on real data with thousands of images, DataLoader configuration has a significant impact on GPU utilization:num_workers— set this to the number of CPU cores available (typically 4-8). Each worker runs in a separate process and pre-loads batches in parallel. Setting this too high can cause memory issues.pin_memory=True— enables pinned (page-locked) memory for faster CPU-to-GPU transfers. Always use this when training on a GPU.persistent_workers=True— keeps worker processes alive between epochs, avoiding the overhead of re-spawning them. Requiresnum_workers > 0.drop_last=True— drops the final incomplete batch, which prevents shape mismatches in batch normalization layers.
Summary
In this notebook we built a complete COCO data pipeline for anchor-free YOLOv11 training. The key components are:- COCOParser — reads COCO JSON annotations, maps non-contiguous category IDs to a contiguous range, and groups annotations by image.
- Letterbox resize — scales images to 640 x 640 while preserving aspect ratio with symmetric gray padding.
- Mosaic augmentation — combines four training images into a single composite to increase object diversity and context variation.
- Multi-scale target encoding — assigns each ground-truth box to the appropriate feature pyramid level (P3/P4/P5) and encodes objectness, center offsets, box dimensions, and class labels into dense grid targets.
- YOLODataset + DataLoader — wraps everything into a PyTorch
Datasetwith a custom collate function that handles variable numbers of objects per image.

