Skip to main content
Open In Colab

Inference, NMS, and COCO Evaluation

Notebook 5 of 5 in the YOLOv11 From-Scratch Series Now that we have a trained model (Notebook 4), we need to turn its raw predictions into usable detections. This notebook covers the complete inference and evaluation pipeline:
  1. Prediction decoding --- converting raw network outputs (class logits and DFL-encoded offsets) into bounding boxes in image coordinates.
  2. Non-Maximum Suppression (NMS) --- a greedy filtering algorithm that removes redundant overlapping detections, keeping only the most confident prediction for each object.
  3. COCO mAP evaluation --- the standardized evaluation protocol that computes Average Precision across multiple IoU thresholds (0.5 to 0.95), providing a single number that captures both localization accuracy and classification performance.
  4. Grad-CAM visualization --- gradient-weighted class activation mapping reveals which spatial regions in the image drive the model’s predictions, offering interpretability into the detection process.
By the end of this notebook, you will have a complete, end-to-end object detection system: from a raw image tensor to evaluated, visualized detections.

Imports

import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import numpy as np
import json, os, time, math
from typing import List, Tuple, Dict, Optional
from collections import defaultdict

from datasets import load_dataset
from PIL import Image
/workspaces/eng-ai-agents/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Model components (from Notebooks 2—3)

The following cells re-define the complete YOLOv11 model architecture introduced in Notebooks 2 and 3. They are reproduced here in compact form so that this notebook is fully self-contained. Refer to those notebooks for detailed explanations of each module.
class ConvBNSiLU(nn.Module):
    def __init__(self, in_c, out_c, k=1, s=1, p=None, g=1):
        super().__init__()
        if p is None:
            p = k // 2
        self.conv = nn.Conv2d(in_c, out_c, k, s, p, groups=g, bias=False)
        self.bn = nn.BatchNorm2d(out_c)
        self.act = nn.SiLU(inplace=True)

    def forward(self, x):
        return self.act(self.bn(self.conv(x)))


class Bottleneck(nn.Module):
    def __init__(self, in_c, out_c, shortcut=True, e=0.5):
        super().__init__()
        mid = int(out_c * e)
        self.cv1 = ConvBNSiLU(in_c, mid, 3)
        self.cv2 = ConvBNSiLU(mid, out_c, 3)
        self.add = shortcut and in_c == out_c

    def forward(self, x):
        y = self.cv2(self.cv1(x))
        return x + y if self.add else y


class C3k2(nn.Module):
    def __init__(self, in_c, out_c, n=2, shortcut=True, e=0.5):
        super().__init__()
        self.c = int(out_c * e)
        self.cv1 = ConvBNSiLU(in_c, 2 * self.c, 1)
        self.cv2 = ConvBNSiLU((2 + n) * self.c, out_c, 1)
        self.bottlenecks = nn.ModuleList(
            Bottleneck(self.c, self.c, shortcut) for _ in range(n)
        )

    def forward(self, x):
        y = list(self.cv1(x).chunk(2, dim=1))
        for bn in self.bottlenecks:
            y.append(bn(y[-1]))
        return self.cv2(torch.cat(y, dim=1))


class SPPF(nn.Module):
    def __init__(self, in_c, out_c, k=5):
        super().__init__()
        mid = in_c // 2
        self.cv1 = ConvBNSiLU(in_c, mid, 1)
        self.cv2 = ConvBNSiLU(mid * 4, out_c, 1)
        self.pool = nn.MaxPool2d(k, stride=1, padding=k // 2)

    def forward(self, x):
        x = self.cv1(x)
        y1 = self.pool(x)
        y2 = self.pool(y1)
        y3 = self.pool(y2)
        return self.cv2(torch.cat([x, y1, y2, y3], dim=1))


class YOLOv11Backbone(nn.Module):
    def __init__(self, in_channels=3, base_channels=64):
        super().__init__()
        c1, c2, c3, c4, c5 = (
            base_channels, base_channels * 2, base_channels * 4,
            base_channels * 8, base_channels * 16,
        )
        self.stem = ConvBNSiLU(in_channels, c1, 3, 2)
        self.stage1 = nn.Sequential(ConvBNSiLU(c1, c2, 3, 2), C3k2(c2, c2, n=2))
        self.stage2 = nn.Sequential(ConvBNSiLU(c2, c3, 3, 2), C3k2(c3, c3, n=2))
        self.stage3 = nn.Sequential(ConvBNSiLU(c3, c4, 3, 2), C3k2(c4, c4, n=2))
        self.stage4 = nn.Sequential(ConvBNSiLU(c4, c5, 3, 2), C3k2(c5, c5, n=2), SPPF(c5, c5))

    def forward(self, x):
        x = self.stem(x)
        x = self.stage1(x)
        p3 = self.stage2(x)
        p4 = self.stage3(p3)
        p5 = self.stage4(p4)
        return p3, p4, p5


class FPN(nn.Module):
    def __init__(self, c3=256, c4=512, c5=1024):
        super().__init__()
        self.lateral_p5 = ConvBNSiLU(c5, c4, 1)
        self.lateral_p4 = ConvBNSiLU(c4, c3, 1)
        self.fpn_p4 = C3k2(c4 + c4, c4, n=2, shortcut=False)
        self.fpn_p3 = C3k2(c3 + c3, c3, n=2, shortcut=False)

    def forward(self, p3, p4, p5):
        p5_up = F.interpolate(self.lateral_p5(p5), size=p4.shape[2:], mode='nearest')
        fpn_p4 = self.fpn_p4(torch.cat([p5_up, p4], dim=1))
        p4_up = F.interpolate(self.lateral_p4(fpn_p4), size=p3.shape[2:], mode='nearest')
        fpn_p3 = self.fpn_p3(torch.cat([p4_up, p3], dim=1))
        return fpn_p3, fpn_p4, p5


class PAN(nn.Module):
    def __init__(self, c3=256, c4=512, c5=1024):
        super().__init__()
        self.down_p3 = ConvBNSiLU(c3, c3, 3, 2)
        self.down_p4 = ConvBNSiLU(c4, c4, 3, 2)
        self.pan_p4 = C3k2(c3 + c4, c4, n=2, shortcut=False)
        self.pan_p5 = C3k2(c4 + c5, c5, n=2, shortcut=False)

    def forward(self, fpn_p3, fpn_p4, p5):
        p3_down = self.down_p3(fpn_p3)
        pan_p4 = self.pan_p4(torch.cat([p3_down, fpn_p4], dim=1))
        p4_down = self.down_p4(pan_p4)
        pan_p5 = self.pan_p5(torch.cat([p4_down, p5], dim=1))
        return fpn_p3, pan_p4, pan_p5


class C2PSA(nn.Module):
    def __init__(self, in_channels, out_channels, n=1):
        super().__init__()
        self.c = in_channels // 2
        self.cv1 = ConvBNSiLU(in_channels, 2 * self.c, 1)
        self.cv2 = ConvBNSiLU(2 * self.c, out_channels, 1)
        self.attention = nn.ModuleList([
            nn.Sequential(
                nn.AdaptiveAvgPool2d(1), nn.Flatten(),
                nn.Linear(self.c, self.c // 4), nn.SiLU(inplace=True),
                nn.Linear(self.c // 4, self.c), nn.Sigmoid()
            ) for _ in range(n)
        ])
        self.bottlenecks = nn.ModuleList(
            [Bottleneck(self.c, self.c, shortcut=True) for _ in range(n)]
        )

    def forward(self, x):
        y = list(self.cv1(x).chunk(2, dim=1))
        for attn, bn in zip(self.attention, self.bottlenecks):
            feat = bn(y[-1])
            att_weights = attn(feat).unsqueeze(-1).unsqueeze(-1)
            feat = feat * att_weights
            y.append(feat)
        return self.cv2(torch.cat([y[0], y[-1]], dim=1))


class DFLHead(nn.Module):
    def __init__(self, reg_max=16):
        super().__init__()
        self.reg_max = reg_max
        self.register_buffer('project', torch.arange(reg_max, dtype=torch.float32))

    def forward(self, x):
        b, _, h, w = x.shape
        x = x.view(b, 4, self.reg_max, h, w)
        x = F.softmax(x, dim=2)
        x = (x * self.project.view(1, 1, -1, 1, 1)).sum(dim=2)
        return x


class DetectionHead(nn.Module):
    def __init__(self, in_channels, num_classes=80, reg_max=16):
        super().__init__()
        self.num_classes = num_classes
        self.reg_max = reg_max
        self.cls_branch = nn.Sequential(
            ConvBNSiLU(in_channels, in_channels, 3),
            ConvBNSiLU(in_channels, in_channels, 3),
            nn.Conv2d(in_channels, num_classes, 1)
        )
        self.reg_branch = nn.Sequential(
            ConvBNSiLU(in_channels, in_channels, 3),
            ConvBNSiLU(in_channels, in_channels, 3),
            nn.Conv2d(in_channels, 4 * reg_max, 1)
        )
        self.dfl = DFLHead(reg_max)

    def forward(self, x):
        cls_pred = self.cls_branch(x)
        box_raw = self.reg_branch(x)
        box_pred = self.dfl(box_raw)
        return cls_pred, box_pred, box_raw


class YOLOv11(nn.Module):
    def __init__(self, num_classes=80, reg_max=16, base_channels=64):
        super().__init__()
        c3, c4, c5 = base_channels * 4, base_channels * 8, base_channels * 16
        self.backbone = YOLOv11Backbone(base_channels=base_channels)
        self.fpn = FPN(c3, c4, c5)
        self.pan = PAN(c3, c4, c5)
        self.c2psa = C2PSA(c5, c5, n=1)
        self.head_p3 = DetectionHead(c3, num_classes, reg_max)
        self.head_p4 = DetectionHead(c4, num_classes, reg_max)
        self.head_p5 = DetectionHead(c5, num_classes, reg_max)
        self.strides = [8, 16, 32]
        self.num_classes = num_classes
        self.reg_max = reg_max

    def forward(self, x):
        p3, p4, p5 = self.backbone(x)
        fpn_p3, fpn_p4, fpn_p5 = self.fpn(p3, p4, p5)
        pan_p3, pan_p4, pan_p5 = self.pan(fpn_p3, fpn_p4, fpn_p5)
        pan_p5 = self.c2psa(pan_p5)
        pred_p3 = self.head_p3(pan_p3)
        pred_p4 = self.head_p4(pan_p4)
        pred_p5 = self.head_p5(pan_p5)
        return [pred_p3, pred_p4, pred_p5]

Prediction Decoding

The model outputs raw class logits and DFL-decoded LTRB (left, top, right, bottom) offsets at each feature map location. To obtain actual bounding boxes in image coordinates, we need to:
  1. Generate anchor points --- for each cell in each feature map, compute its center in pixel coordinates. A cell at grid position (i,j)(i, j) in a feature map with stride ss corresponds to pixel center ((i+0.5)×s,  (j+0.5)×s)((i + 0.5) \times s, \; (j + 0.5) \times s).
  2. Apply stride-aware decoding --- the LTRB offsets predicted by the network are in feature-map units. Multiplying by the stride converts them to pixel distances. The final box coordinates are:
x1=cxl×s,y1=cyt×s,x2=cx+r×s,y2=cy+b×sx_1 = c_x - l \times s, \quad y_1 = c_y - t \times s, \quad x_2 = c_x + r \times s, \quad y_2 = c_y + b \times s where (cx,cy)(c_x, c_y) is the anchor center, (l,t,r,b)(l, t, r, b) are the predicted offsets, and ss is the stride.
  1. Filter by confidence --- discard predictions with class probability below a threshold (e.g., 0.25) to reduce the number of candidates before NMS.
def make_anchor_points(feat_sizes, strides, device='cpu'):
    """Generate anchor center points for all feature levels.
    
    For each feature map cell, computes the center in pixel coordinates.
    A cell at position (i, j) with stride s maps to pixel center
    ((i + 0.5) * s, (j + 0.5) * s).
    
    Args:
        feat_sizes: list of (H, W) for each scale
        strides: list of stride values [8, 16, 32]
        device: target device
    Returns:
        anchor_points: (total_anchors, 2) pixel centers [x, y]
        anchor_strides: (total_anchors,) stride per anchor
    """
    all_points = []
    all_strides_tensor = []
    for (h, w), stride in zip(feat_sizes, strides):
        sy, sx = torch.meshgrid(torch.arange(h, device=device),
                                torch.arange(w, device=device), indexing='ij')
        points = torch.stack([sx.flatten(), sy.flatten()], dim=-1).float()
        points = (points + 0.5) * stride
        all_points.append(points)
        all_strides_tensor.append(torch.full((h * w,), stride, dtype=torch.float32, device=device))
    return torch.cat(all_points), torch.cat(all_strides_tensor)


def decode_predictions(predictions, strides=[8, 16, 32], conf_thresh=0.25, img_size=640):
    """Decode raw model outputs into boxes with scores.
    
    Args:
        predictions: list of (cls_pred, box_pred, box_raw) per scale
        strides: feature strides
        conf_thresh: confidence threshold for filtering
        img_size: input image size for clamping
    Returns:
        batch_results: list of (boxes, scores, class_ids) per image
            boxes: (N, 4) decoded boxes [x1, y1, x2, y2]
            scores: (N,) confidence scores
            class_ids: (N,) predicted class indices
    """
    device = predictions[0][0].device
    feat_sizes = [(p[0].shape[2], p[0].shape[3]) for p in predictions]
    anchor_points, anchor_strides = make_anchor_points(feat_sizes, strides, device)
    
    # Concatenate across scales
    all_cls = torch.cat([p[0].flatten(2).permute(0, 2, 1) for p in predictions], dim=1)
    all_box = torch.cat([p[1].flatten(2).permute(0, 2, 1) for p in predictions], dim=1)
    
    batch_results = []
    batch_size = all_cls.shape[0]
    
    for b in range(batch_size):
        cls_scores = all_cls[b].sigmoid()  # (num_anchors, num_classes)
        box_pred = all_box[b]              # (num_anchors, 4) LTRB
        
        # Decode LTRB to xyxy
        lt = box_pred[:, :2] * anchor_strides.unsqueeze(-1)
        rb = box_pred[:, 2:] * anchor_strides.unsqueeze(-1)
        x1y1 = anchor_points - lt
        x2y2 = anchor_points + rb
        boxes = torch.cat([x1y1, x2y2], dim=-1)
        
        # Get max class score per anchor
        max_scores, class_ids = cls_scores.max(dim=1)
        
        # Filter by confidence
        mask = max_scores > conf_thresh
        boxes = boxes[mask]
        scores = max_scores[mask]
        class_ids = class_ids[mask]
        
        # Clip to image bounds
        boxes = boxes.clamp(0, img_size)
        
        batch_results.append((boxes, scores, class_ids))
    
    return batch_results

Non-Maximum Suppression

After decoding and confidence filtering, we typically have many overlapping detections for the same object. Non-Maximum Suppression (NMS) is a greedy post-processing algorithm that keeps only the most confident detection for each object instance:
  1. Sort all detections by confidence score (descending).
  2. Select the highest-scoring detection and add it to the output.
  3. Compute IoU between the selected detection and all remaining candidates.
  4. Remove any candidate whose IoU with the selected detection exceeds a threshold (e.g., 0.45)---these are considered redundant detections of the same object.
  5. Repeat from step 2 until no candidates remain.
For multi-class detection, we apply NMS independently per class (“batched NMS”) to avoid suppressing detections of different object categories that happen to overlap spatially.
def nms(boxes, scores, iou_threshold=0.45):
    """Non-Maximum Suppression from scratch.
    
    Args:
        boxes: (N, 4) [x1, y1, x2, y2]
        scores: (N,) confidence scores
        iou_threshold: IoU threshold for suppression
    Returns:
        keep: indices of kept boxes
    """
    if len(boxes) == 0:
        return torch.tensor([], dtype=torch.long)
    
    x1, y1, x2, y2 = boxes[:, 0], boxes[:, 1], boxes[:, 2], boxes[:, 3]
    areas = (x2 - x1) * (y2 - y1)
    
    # Sort by score (descending)
    order = scores.argsort(descending=True)
    
    keep = []
    while len(order) > 0:
        i = order[0].item()
        keep.append(i)
        
        if len(order) == 1:
            break
        
        # Compute IoU with remaining boxes
        remaining = order[1:]
        inter_x1 = torch.max(x1[i], x1[remaining])
        inter_y1 = torch.max(y1[i], y1[remaining])
        inter_x2 = torch.min(x2[i], x2[remaining])
        inter_y2 = torch.min(y2[i], y2[remaining])
        
        inter = (inter_x2 - inter_x1).clamp(0) * (inter_y2 - inter_y1).clamp(0)
        union = areas[i] + areas[remaining] - inter
        iou = inter / (union + 1e-7)
        
        # Keep boxes with IoU below threshold
        mask = iou <= iou_threshold
        order = remaining[mask]
    
    return torch.tensor(keep, dtype=torch.long)


def batched_nms(boxes, scores, class_ids, iou_threshold=0.45):
    """Per-class NMS.
    
    Applies NMS independently for each class to avoid suppressing
    detections of different categories that overlap spatially.
    """
    if len(boxes) == 0:
        return boxes, scores, class_ids
    
    keep_all = []
    for cls in class_ids.unique():
        cls_mask = class_ids == cls
        cls_boxes = boxes[cls_mask]
        cls_scores = scores[cls_mask]
        cls_indices = torch.where(cls_mask)[0]
        
        keep = nms(cls_boxes, cls_scores, iou_threshold)
        keep_all.append(cls_indices[keep])
    
    if keep_all:
        keep_all = torch.cat(keep_all)
        # Sort by score
        sorted_idx = scores[keep_all].argsort(descending=True)
        keep_all = keep_all[sorted_idx]
    else:
        keep_all = torch.tensor([], dtype=torch.long)
    
    return boxes[keep_all], scores[keep_all], class_ids[keep_all]

End-to-End Inference Pipeline

The detect function chains the complete inference pipeline: forward pass through the model, prediction decoding, and per-class NMS. The result is a clean set of bounding boxes, confidence scores, and class labels ready for visualization or evaluation.
@torch.no_grad()
def detect(model, image_tensor, conf_thresh=0.25, iou_thresh=0.45):
    """Full inference pipeline: model -> decode -> NMS -> results.
    
    Args:
        model: YOLOv11 model
        image_tensor: (1, 3, 640, 640) normalized
        conf_thresh: confidence threshold
        iou_thresh: NMS IoU threshold
    Returns:
        boxes: (N, 4) [x1, y1, x2, y2]
        scores: (N,) confidences
        class_ids: (N,) class indices
    """
    model.eval()
    predictions = model(image_tensor)
    batch_results = decode_predictions(predictions, conf_thresh=conf_thresh)
    boxes, scores, class_ids = batch_results[0]
    boxes, scores, class_ids = batched_nms(boxes, scores, class_ids, iou_thresh)
    return boxes, scores, class_ids

Visualization Functions

COCO_NAMES = ['person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck',
              'boat', 'traffic light', 'fire hydrant', 'stop sign', 'parking meter', 'bench',
              'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear', 'zebra',
              'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee',
              'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove',
              'skateboard', 'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup',
              'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 'sandwich', 'orange',
              'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch',
              'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse',
              'remote', 'keyboard', 'cell phone', 'microwave', 'oven', 'toaster', 'sink',
              'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear',
              'hair drier', 'toothbrush']


def visualize_detections(image_tensor, boxes, scores, class_ids, class_names=None, max_det=50):
    """Draw detection results on image."""
    fig, ax = plt.subplots(1, 1, figsize=(12, 12))
    
    img = image_tensor[0].permute(1, 2, 0).cpu().numpy()
    ax.imshow(img)
    
    colors = plt.cm.Set2(np.linspace(0, 1, 80))
    
    for i in range(min(len(boxes), max_det)):
        x1, y1, x2, y2 = boxes[i].cpu().numpy()
        score = scores[i].cpu().item()
        cls = class_ids[i].cpu().item()
        
        color = colors[cls % len(colors)]
        rect = patches.Rectangle((x1, y1), x2-x1, y2-y1, linewidth=2,
                                  edgecolor=color, facecolor='none')
        ax.add_patch(rect)
        
        name = class_names[cls] if class_names else str(cls)
        label = f'{name}: {score:.2f}'
        ax.text(x1, y1 - 5, label, color='white', fontsize=9,
                bbox=dict(boxstyle='round,pad=0.2', facecolor=color, alpha=0.8))
    
    ax.axis('off')
    ax.set_title(f'Detections: {len(boxes)} objects')
    plt.tight_layout()
    plt.show()

Demo inference on a real COCO image

We run the full detection pipeline on a real COCO validation image streamed from Hugging Face. With random (untrained) weights, we expect few or no detections above the confidence threshold, but the image itself will be real.
Data source: Images streamed from detection-datasets/coco. See our HF COCO streaming tutorial for details.
# Load a real COCO validation image for demo inference
model = YOLOv11(num_classes=80)
model.eval()

# Stream a single COCO validation image
print("Loading real COCO image from Hugging Face...")
ds = load_dataset('detection-datasets/coco', split='val', streaming=True)
example = next(iter(ds))

img_pil = example['image'].convert('RGB')
img_np = np.array(img_pil)

# Letterbox resize to 640x640
h_orig, w_orig = img_np.shape[:2]
scale = 640 / max(h_orig, w_orig)
new_w, new_h = int(w_orig * scale), int(h_orig * scale)
resized = np.array(img_pil.resize((new_w, new_h)))

padded = np.full((640, 640, 3), 114, dtype=np.uint8)
pad_w = (640 - new_w) // 2
pad_h = (640 - new_h) // 2
padded[pad_h:pad_h + new_h, pad_w:pad_w + new_w] = resized

x = torch.from_numpy(padded).permute(2, 0, 1).float().unsqueeze(0) / 255.0

# Run inference
start = time.time()
boxes, scores, class_ids = detect(model, x, conf_thresh=0.1, iou_thresh=0.45)
elapsed = time.time() - start

print(f"Inference time: {elapsed*1000:.1f} ms")
print(f"Detections after NMS: {len(boxes)}")

if len(boxes) > 0:
    print(f"Top detection: class={COCO_NAMES[class_ids[0]]}, score={scores[0]:.3f}")
    visualize_detections(x, boxes, scores, class_ids, COCO_NAMES)
else:
    print("No detections above threshold (expected with random weights)")
    # Show the input image
    fig, ax = plt.subplots(1, 1, figsize=(10, 10))
    ax.imshow(x[0].permute(1, 2, 0).numpy())
    ax.set_title('Input COCO Image (no detections with random weights)')
    ax.axis('off')
    plt.show()
Loading real COCO image from Hugging Face...
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Inference time: 584.9 ms
Detections after NMS: 228
Top detection: class=sports ball, score=0.515
Output from cell 7

COCO mAP Evaluation

The COCO evaluation protocol is the standard benchmark for object detection. Unlike simple accuracy metrics, it captures both localization precision and classification performance in a single number:
  • AP (Average Precision) is computed per class as the area under the precision-recall curve, using 101-point interpolation.
  • AP@IoU evaluates at a specific IoU threshold. [email protected] is lenient (50% overlap required), while [email protected] is strict.
  • [email protected]:0.95 averages AP across 10 IoU thresholds from 0.50 to 0.95 in steps of 0.05, then averages across all classes. This is the primary COCO metric.
The evaluation process:
  1. For each class, sort predictions by confidence score (descending).
  2. Match each prediction to ground-truth boxes using IoU. A prediction is a true positive (TP) if its IoU with an unmatched ground-truth box exceeds the threshold; otherwise it is a false positive (FP).
  3. Compute precision and recall at each confidence level.
  4. Interpolate the precision-recall curve and compute the area under it.
class COCOEvaluator:
    """Simplified COCO-style mAP evaluator.
    
    Computes Average Precision at IoU thresholds from 0.5 to 0.95.
    """
    
    def __init__(self, iou_thresholds=None):
        if iou_thresholds is None:
            self.iou_thresholds = np.arange(0.5, 1.0, 0.05)
        else:
            self.iou_thresholds = np.array(iou_thresholds)
        self.predictions = []  # list of dicts
        self.ground_truths = []  # list of dicts
    
    def add_predictions(self, image_id: int, boxes: np.ndarray, scores: np.ndarray, class_ids: np.ndarray):
        """Add predicted detections for one image."""
        for i in range(len(boxes)):
            self.predictions.append({
                'image_id': image_id,
                'bbox': boxes[i],     # [x1, y1, x2, y2]
                'score': float(scores[i]),
                'class_id': int(class_ids[i])
            })
    
    def add_ground_truths(self, image_id: int, boxes: np.ndarray, class_ids: np.ndarray):
        """Add ground-truth annotations for one image."""
        for i in range(len(boxes)):
            self.ground_truths.append({
                'image_id': image_id,
                'bbox': boxes[i],
                'class_id': int(class_ids[i]),
                'matched': set()
            })
    
    def _compute_iou_matrix(self, pred_boxes, gt_boxes):
        """Compute pairwise IoU between prediction and GT boxes."""
        x1 = np.maximum(pred_boxes[:, None, 0], gt_boxes[None, :, 0])
        y1 = np.maximum(pred_boxes[:, None, 1], gt_boxes[None, :, 1])
        x2 = np.minimum(pred_boxes[:, None, 2], gt_boxes[None, :, 2])
        y2 = np.minimum(pred_boxes[:, None, 3], gt_boxes[None, :, 3])
        
        inter = np.maximum(x2 - x1, 0) * np.maximum(y2 - y1, 0)
        area_pred = (pred_boxes[:, 2] - pred_boxes[:, 0]) * (pred_boxes[:, 3] - pred_boxes[:, 1])
        area_gt = (gt_boxes[:, 2] - gt_boxes[:, 0]) * (gt_boxes[:, 3] - gt_boxes[:, 1])
        union = area_pred[:, None] + area_gt[None, :] - inter
        
        return inter / (union + 1e-7)
    
    def compute_ap(self, precisions, recalls):
        """Compute AP using 101-point interpolation (COCO style)."""
        recall_points = np.linspace(0, 1, 101)
        interpolated = np.zeros_like(recall_points)
        
        for i, r in enumerate(recall_points):
            precs_above = precisions[recalls >= r]
            interpolated[i] = precs_above.max() if len(precs_above) > 0 else 0
        
        return interpolated.mean()
    
    def evaluate(self):
        """Run full COCO-style evaluation.
        
        Returns:
            results: dict with [email protected]:0.95, [email protected], [email protected], and per-threshold APs
        """
        all_classes = set(p['class_id'] for p in self.predictions) | \
                      set(g['class_id'] for g in self.ground_truths)
        
        aps_per_threshold = {t: [] for t in self.iou_thresholds}
        
        for cls in sorted(all_classes):
            cls_preds = [p for p in self.predictions if p['class_id'] == cls]
            cls_gts = [g for g in self.ground_truths if g['class_id'] == cls]
            
            if not cls_gts:
                continue
            
            # Sort predictions by score
            cls_preds.sort(key=lambda x: x['score'], reverse=True)
            
            for iou_thresh in self.iou_thresholds:
                tp = np.zeros(len(cls_preds))
                fp = np.zeros(len(cls_preds))
                matched_gt = set()
                
                pred_boxes = np.array([p['bbox'] for p in cls_preds])
                gt_boxes = np.array([g['bbox'] for g in cls_gts])
                
                if len(pred_boxes) == 0:
                    aps_per_threshold[iou_thresh].append(0.0)
                    continue
                
                iou_matrix = self._compute_iou_matrix(pred_boxes, gt_boxes)
                
                for i in range(len(cls_preds)):
                    img_id = cls_preds[i]['image_id']
                    # Find matching GTs from same image
                    gt_indices = [j for j, g in enumerate(cls_gts) if g['image_id'] == img_id]
                    
                    best_iou = 0
                    best_gt = -1
                    for j in gt_indices:
                        if j not in matched_gt and iou_matrix[i, j] > best_iou:
                            best_iou = iou_matrix[i, j]
                            best_gt = j
                    
                    if best_iou >= iou_thresh and best_gt >= 0:
                        tp[i] = 1
                        matched_gt.add(best_gt)
                    else:
                        fp[i] = 1
                
                cum_tp = np.cumsum(tp)
                cum_fp = np.cumsum(fp)
                recalls = cum_tp / len(cls_gts)
                precisions = cum_tp / (cum_tp + cum_fp + 1e-7)
                
                ap = self.compute_ap(precisions, recalls)
                aps_per_threshold[iou_thresh].append(ap)
        
        results = {}
        for t in self.iou_thresholds:
            if aps_per_threshold[t]:
                results[f'AP@{t:.2f}'] = np.mean(aps_per_threshold[t])
        
        # mAP across all thresholds
        all_aps = [v for v in results.values()]
        results['[email protected]:0.95'] = np.mean(all_aps) if all_aps else 0.0
        results['[email protected]'] = results.get('[email protected]', 0.0)
        results['[email protected]'] = results.get('[email protected]', 0.0)
        
        return results

Evaluation on real COCO images

We evaluate the model on 10 real COCO validation images streamed from Hugging Face. Ground-truth annotations from the dataset serve as reference. With random (untrained) weights, mAP will be near zero — this demonstrates the evaluation pipeline rather than model quality.
# Evaluation with real COCO validation images
evaluator = COCOEvaluator()

print("Streaming 10 COCO validation images for evaluation demo...")
ds = load_dataset('detection-datasets/coco', split='val', streaming=True)

model = YOLOv11(num_classes=80)
model.eval()

for img_id, example in enumerate(ds):
    if img_id >= 10:
        break

    img_pil = example['image'].convert('RGB')
    img_np = np.array(img_pil)
    h_orig, w_orig = img_np.shape[:2]

    # Prepare ground truth (convert COCO [x,y,w,h] to [x1,y1,x2,y2])
    bboxes = example['objects']['bbox']
    cats = example['objects']['category']

    gt_boxes = []
    gt_classes = []
    for bbox, cat_id in zip(bboxes, cats):
        bx, by, bw, bh = bbox
        if bw <= 0 or bh <= 0:
            continue
        gt_boxes.append([bx, by, bx + bw, by + bh])
        gt_classes.append(int(cat_id))

    if len(gt_boxes) == 0:
        continue

    gt_boxes = np.array(gt_boxes, dtype=np.float32)
    gt_classes = np.array(gt_classes, dtype=np.int64)
    evaluator.add_ground_truths(img_id, gt_boxes, gt_classes)

    # Letterbox resize and run inference
    scale = 640 / max(h_orig, w_orig)
    new_w, new_h = int(w_orig * scale), int(h_orig * scale)
    resized = np.array(img_pil.resize((new_w, new_h)))
    padded = np.full((640, 640, 3), 114, dtype=np.uint8)
    pad_w = (640 - new_w) // 2
    pad_h = (640 - new_h) // 2
    padded[pad_h:pad_h + new_h, pad_w:pad_w + new_w] = resized

    img_tensor = torch.from_numpy(padded).permute(2, 0, 1).float().unsqueeze(0) / 255.0
    boxes, scores, class_ids = detect(model, img_tensor, conf_thresh=0.1, iou_thresh=0.45)

    if len(boxes) > 0:
        # Scale boxes back to original image coordinates
        pred_boxes = boxes.cpu().numpy()
        pred_boxes[:, [0, 2]] = (pred_boxes[:, [0, 2]] - pad_w) / scale
        pred_boxes[:, [1, 3]] = (pred_boxes[:, [1, 3]] - pad_h) / scale
        evaluator.add_predictions(img_id, pred_boxes, scores.cpu().numpy(),
                                  class_ids.cpu().numpy())

results = evaluator.evaluate()
print("\n=== COCO-Style Evaluation Results (random weights) ===")
for k, v in sorted(results.items()):
    print(f"  {k}: {v:.4f}")
print("\nNote: Near-zero mAP is expected with random (untrained) weights.")
Streaming 10 COCO validation images for evaluation demo...

=== COCO-Style Evaluation Results (random weights) ===
  [email protected]: 0.0000
  [email protected]: 0.0000
  [email protected]: 0.0000
  [email protected]: 0.0000
  [email protected]: 0.0000
  [email protected]: 0.0000
  [email protected]: 0.0000
  [email protected]: 0.0000
  [email protected]: 0.0000
  [email protected]: 0.0000
  [email protected]: 0.0000
  [email protected]:0.95: 0.0000
  [email protected]: 0.0000

Note: Near-zero mAP is expected with random (untrained) weights.

Grad-CAM: Visualizing Model Attention

Gradient-weighted Class Activation Mapping (Grad-CAM) provides visual explanations for model predictions by highlighting which spatial regions in the input image contribute most to a particular class prediction. The algorithm:
  1. Perform a forward pass and record activations at a target convolutional layer (typically the last layer in the backbone).
  2. Compute the gradient of the target class score with respect to those activations.
  3. Global-average-pool the gradients to obtain per-channel importance weights.
  4. Compute a weighted sum of the activation channels, followed by ReLU to keep only positive contributions.
  5. Upsample the resulting heatmap to the input image size.
For object detection, Grad-CAM reveals whether the model is attending to the correct object regions or relying on spurious contextual cues.
class GradCAM:
    """Gradient-weighted Class Activation Mapping for object detection.
    
    Visualizes which spatial regions in the image contribute most
    to the model's classification decisions.
    """
    
    def __init__(self, model, target_layer):
        self.model = model
        self.target_layer = target_layer
        self.gradients = None
        self.activations = None
        
        # Register hooks
        target_layer.register_forward_hook(self._save_activation)
        target_layer.register_full_backward_hook(self._save_gradient)
    
    def _save_activation(self, module, input, output):
        self.activations = output.detach()
    
    def _save_gradient(self, module, grad_input, grad_output):
        self.gradients = grad_output[0].detach()
    
    def generate(self, input_tensor, class_idx=None):
        """Generate Grad-CAM heatmap.
        
        Args:
            input_tensor: (1, 3, H, W)
            class_idx: target class (None = use predicted class)
        Returns:
            heatmap: (H, W) normalized [0, 1]
        """
        self.model.eval()
        output = self.model(input_tensor)
        
        # Use P3 predictions (highest resolution)
        cls_pred = output[0][0]  # (1, num_classes, H, W)
        
        if class_idx is None:
            # Use the max activation class
            class_idx = cls_pred.sum(dim=(0, 2, 3)).argmax().item()
        
        # Backward for target class
        self.model.zero_grad()
        target = cls_pred[0, class_idx].sum()
        target.backward(retain_graph=True)
        
        if self.gradients is None:
            print("Warning: No gradients captured")
            return np.zeros((input_tensor.shape[2], input_tensor.shape[3]))
        
        # Weight activations by gradients
        weights = self.gradients.mean(dim=(2, 3), keepdim=True)
        cam = (weights * self.activations).sum(dim=1, keepdim=True)
        cam = F.relu(cam)
        
        # Upsample to input size
        cam = F.interpolate(cam, size=input_tensor.shape[2:], mode='bilinear', align_corners=False)
        cam = cam.squeeze().cpu().numpy()
        
        # Normalize
        cam = (cam - cam.min()) / (cam.max() - cam.min() + 1e-8)
        return cam


def visualize_gradcam(image_tensor, heatmap, alpha=0.5):
    """Overlay Grad-CAM heatmap on image."""
    fig, axes = plt.subplots(1, 3, figsize=(18, 6))
    
    img = image_tensor[0].permute(1, 2, 0).cpu().numpy()
    img = (img - img.min()) / (img.max() - img.min() + 1e-8)
    
    axes[0].imshow(img)
    axes[0].set_title('Input Image')
    axes[0].axis('off')
    
    axes[1].imshow(heatmap, cmap='jet')
    axes[1].set_title('Grad-CAM Heatmap')
    axes[1].axis('off')
    
    # Overlay
    heatmap_colored = plt.cm.jet(heatmap)[:, :, :3]
    overlay = img * (1 - alpha) + heatmap_colored * alpha
    overlay = np.clip(overlay, 0, 1)
    axes[2].imshow(overlay)
    axes[2].set_title('Overlay')
    axes[2].axis('off')
    
    plt.suptitle('Grad-CAM Visualization', fontsize=14)
    plt.tight_layout()
    plt.show()

Grad-CAM Demo

We apply Grad-CAM to the backbone’s SPPF output layer to visualize which spatial regions influence the model’s class predictions. With random weights, the heatmap will appear noisy; with a trained model, it would highlight object-relevant regions.
# Run Grad-CAM on a real COCO image
model = YOLOv11(num_classes=80)
cam = GradCAM(model, model.backbone.stage4[-1].cv2)

# Stream a real COCO image for Grad-CAM visualization
print("Loading real COCO image for Grad-CAM...")
ds = load_dataset('detection-datasets/coco', split='val', streaming=True)
example = next(iter(ds))
img_pil = example['image'].convert('RGB')
img_np = np.array(img_pil)
h_orig, w_orig = img_np.shape[:2]

# Letterbox resize
scale = 640 / max(h_orig, w_orig)
new_w, new_h = int(w_orig * scale), int(h_orig * scale)
resized = np.array(img_pil.resize((new_w, new_h)))
padded = np.full((640, 640, 3), 114, dtype=np.uint8)
pad_w = (640 - new_w) // 2
pad_h = (640 - new_h) // 2
padded[pad_h:pad_h + new_h, pad_w:pad_w + new_w] = resized

x = torch.from_numpy(padded).permute(2, 0, 1).float().unsqueeze(0) / 255.0
x.requires_grad_(True)

heatmap = cam.generate(x)
visualize_gradcam(x.detach(), heatmap)
print("Grad-CAM shows which regions influence classification on a real COCO image")
Loading real COCO image for Grad-CAM...
Output from cell 11
Grad-CAM shows which regions influence classification on a real COCO image

Performance Benchmarking

We measure the end-to-end latency of the inference pipeline (forward pass + decoding + NMS) to establish a CPU baseline. GPU inference would be significantly faster.
model = YOLOv11(num_classes=80)
model.eval()

# Use a real COCO image for benchmarking
print("Loading real COCO image for benchmark...")
ds = load_dataset('detection-datasets/coco', split='val', streaming=True)
example = next(iter(ds))
img_pil = example['image'].convert('RGB')
img_np = np.array(img_pil)
h_orig, w_orig = img_np.shape[:2]

scale = 640 / max(h_orig, w_orig)
new_w, new_h = int(w_orig * scale), int(h_orig * scale)
resized = np.array(img_pil.resize((new_w, new_h)))
padded = np.full((640, 640, 3), 114, dtype=np.uint8)
pad_w = (640 - new_w) // 2
pad_h = (640 - new_h) // 2
padded[pad_h:pad_h + new_h, pad_w:pad_w + new_w] = resized

x = torch.from_numpy(padded).permute(2, 0, 1).float().unsqueeze(0) / 255.0

# Warm up
for _ in range(3):
    with torch.no_grad():
        _ = model(x)

# Benchmark
times = []
for _ in range(10):
    start = time.time()
    with torch.no_grad():
        predictions = model(x)
        boxes, scores, cids = decode_predictions(predictions, conf_thresh=0.25)[0]
        if len(boxes) > 0:
            boxes, scores, cids = batched_nms(boxes, scores, cids, 0.45)
    times.append(time.time() - start)

avg_time = np.mean(times) * 1000
print(f"Average inference time (CPU): {avg_time:.1f} ms")
print(f"FPS (CPU): {1000/avg_time:.1f}")
print(f"Note: GPU inference would be significantly faster")
Loading real COCO image for benchmark...
Average inference time (CPU): 384.7 ms
FPS (CPU): 2.6
Note: GPU inference would be significantly faster

Summary

This notebook completed the inference and evaluation pipeline for our from-scratch YOLOv11 implementation:
  • Prediction decoding converts raw network outputs (class logits and DFL-encoded LTRB offsets) into bounding boxes in image coordinates through stride-aware anchor point generation.
  • Non-Maximum Suppression (NMS) removes redundant overlapping detections using a greedy algorithm that keeps the highest-confidence prediction for each object, applied independently per class.
  • COCO mAP evaluation provides standardized metrics by computing Average Precision across multiple IoU thresholds (0.50 to 0.95), capturing both localization accuracy and classification performance.
  • Grad-CAM visualization offers model interpretability by highlighting which spatial regions drive the network’s predictions.

Series recap

This concludes the 5-notebook series building YOLOv11 from scratch:
  1. Notebook 1 --- COCO data loading and augmentation pipeline.
  2. Notebook 2 --- Backbone architecture: ConvBNSiLU, C3k2 (CSP bottleneck), SPPF for multi-scale feature extraction.
  3. Notebook 3 --- Neck and head: FPN (top-down) + PAN (bottom-up) feature fusion, C2PSA attention, and DFL-based decoupled detection heads.
  4. Notebook 4 --- Loss functions and training: Task-Aligned Label Assignment, BCE classification loss, CIoU + DFL regression loss, and the training loop.
  5. Notebook 5 --- Inference, NMS, COCO evaluation, and Grad-CAM (this notebook).
Key architectural takeaways:
  • Anchor-free detection eliminates the need for hand-designed anchor priors.
  • C3k2 blocks with cross-stage partial connections enable efficient feature reuse.
  • SPPF provides multi-scale receptive fields with minimal computational overhead.
  • Bidirectional FPN+PAN neck ensures both semantic and localization information flows across all scales.
  • DFL (Distribution Focal Loss) models boundary uncertainty through discrete distributions, improving localization precision.
  • Task-Aligned Assignment dynamically matches predictions to ground truths based on both classification and localization quality.