Skip to main content
Open In Colab

Building the YOLOv11 Backbone

Notebook 2 of 5 in the YOLOv11 from-scratch series

Introduction

The backbone is the feature extraction engine of any object detection model. In YOLOv11, the backbone extracts hierarchical features at multiple spatial scales, enabling the detector to find objects ranging from small pedestrians to large vehicles in a single forward pass.

Key innovations in the YOLOv11 backbone

  1. C3k2 block - A Cross Stage Partial (CSP) bottleneck that uses 2 convolutions instead of 3. It splits the input channels, processes one branch through a series of bottleneck blocks, collects intermediate outputs, concatenates everything, and projects back. This is more parameter-efficient than the older C3 block while achieving similar representational power.
  2. SPPF (Spatial Pyramid Pooling - Fast) - Applies three sequential 5x5 max-pooling operations (equivalent to 5x5, 9x9, and 13x13 receptive fields) to capture multi-scale contextual information without increasing computational cost significantly.
  3. Multi-scale outputs - The backbone produces three feature maps at different resolutions:
    • P3: stride 8 (80x80 for 640x640 input) - fine-grained features for small objects
    • P4: stride 16 (40x40) - mid-level features for medium objects
    • P5: stride 32 (20x20) - coarse features with large receptive field for large objects
By the end of this notebook, you will have a fully functional YOLOv11 backbone implemented from scratch in PyTorch.

Imports

import torch
import torch.nn as nn
import torch.nn.functional as F
import matplotlib.pyplot as plt
import numpy as np
from typing import List, Tuple

Building blocks: Conv-BN-SiLU

Every convolutional layer in modern YOLO architectures follows the same pattern:
  1. Convolution (nn.Conv2d) - the learnable spatial filter, with bias=False since batch normalization handles the bias term.
  2. Batch Normalization (nn.BatchNorm2d) - normalizes activations across the batch, stabilizing training and allowing higher learning rates.
  3. SiLU activation (also known as Swish: f(x)=xσ(x)f(x) = x \cdot \sigma(x)) - a smooth, non-monotonic activation that consistently outperforms ReLU in detection tasks.
This pattern is so pervasive that we encapsulate it in a single ConvBNSiLU module. The padding parameter defaults to kernel_size // 2, which preserves spatial dimensions for odd kernel sizes (the standard choice).
class ConvBNSiLU(nn.Module):
    """Standard Conv + BatchNorm + SiLU (Swish) activation block."""

    def __init__(self, in_channels: int, out_channels: int, kernel_size: int = 1,
                 stride: int = 1, padding: int = None, groups: int = 1):
        super().__init__()
        if padding is None:
            padding = kernel_size // 2
        self.conv = nn.Conv2d(in_channels, out_channels, kernel_size, stride,
                              padding, groups=groups, bias=False)
        self.bn = nn.BatchNorm2d(out_channels)
        self.act = nn.SiLU(inplace=True)

    def forward(self, x):
        return self.act(self.bn(self.conv(x)))

Bottleneck block

The Bottleneck is the fundamental processing unit inside CSP blocks. It consists of two convolutions:
  1. A squeeze convolution that reduces channels by the expansion factor (default 0.5).
  2. An expand convolution that restores the channel count.
When shortcut=True and the input/output channel counts match, a residual connection adds the input directly to the output. This identity shortcut helps gradients flow through deep networks and has been a cornerstone of modern architectures since ResNet. The kernel_size parameter accepts a tuple (k1, k2) to independently set the kernel size for each convolution. YOLOv11’s C3k2 block uses (3, 3) by default.
class Bottleneck(nn.Module):
    """Standard bottleneck with optional residual connection."""

    def __init__(self, in_channels: int, out_channels: int, shortcut: bool = True,
                 kernel_size: Tuple[int, int] = (3, 3), expansion: float = 0.5):
        super().__init__()
        hidden = int(out_channels * expansion)
        self.cv1 = ConvBNSiLU(in_channels, hidden, kernel_size[0])
        self.cv2 = ConvBNSiLU(hidden, out_channels, kernel_size[1])
        self.add = shortcut and in_channels == out_channels

    def forward(self, x):
        return x + self.cv2(self.cv1(x)) if self.add else self.cv2(self.cv1(x))

C3k2 block: CSP with 2 convolutions

The C3k2 (Cross Stage Partial with 2 convolutions) block is a key architectural element of YOLOv11. It improves upon earlier CSP designs (C3, C2f) by being more parameter-efficient.

How CSP works

The Cross Stage Partial (CSP) design philosophy is:
  1. Split: A 1x1 convolution (cv1) projects the input into 2 * hidden_channels, then the output is split (chunked) into two equal halves along the channel dimension.
  2. Transform: One half passes through a series of n bottleneck blocks. Crucially, each bottleneck’s output is collected (not just the final one), creating a dense connection pattern.
  3. Concatenate: The original split half, plus all n bottleneck outputs (total of 2 + n feature groups), are concatenated along the channel dimension.
  4. Project: A final 1x1 convolution (cv2) fuses the concatenated features back to the desired output channel count.
The “2 convolutions” in C3k2 refers to the two projection convolutions (cv1 and cv2), distinguishing it from C3 which uses three. The default kernel size pair (3, 3) in each bottleneck gives the block its name suffix “k2” (2 kernels of size 3).
class C3k2(nn.Module):
    """CSP Bottleneck with 2 convolutions (YOLOv11 variant).

    Splits input channels, processes one part through bottleneck blocks,
    concatenates, and projects back. More efficient than C3 with similar performance.
    """

    def __init__(self, in_channels: int, out_channels: int, n: int = 1,
                 shortcut: bool = True, expansion: float = 0.5):
        super().__init__()
        self.c = int(out_channels * expansion)  # hidden channels
        self.cv1 = ConvBNSiLU(in_channels, 2 * self.c, 1)
        self.cv2 = ConvBNSiLU((2 + n) * self.c, out_channels, 1)
        self.bottlenecks = nn.ModuleList(
            Bottleneck(self.c, self.c, shortcut, kernel_size=(3, 3), expansion=1.0)
            for _ in range(n)
        )

    def forward(self, x):
        # Split into two branches
        y = list(self.cv1(x).chunk(2, dim=1))
        # Pass through sequential bottlenecks, collecting outputs
        for bn in self.bottlenecks:
            y.append(bn(y[-1]))
        return self.cv2(torch.cat(y, dim=1))

SPPF: Spatial Pyramid Pooling - Fast

The SPPF (Spatial Pyramid Pooling - Fast) module addresses a fundamental challenge: how to capture context at multiple spatial scales without drastically increasing computation.

Design

The original SPP module applied max-pooling with three different kernel sizes (5, 9, 13) in parallel. SPPF achieves the same effective receptive fields by applying a single 5x5 max-pool operation three times sequentially:
  • After 1 pool: effective receptive field of 5x5
  • After 2 pools: effective receptive field of 9x9
  • After 3 pools: effective receptive field of 13x13
The four feature maps (original + 3 pooled versions) are concatenated and projected through a 1x1 convolution. Using stride=1 and padding=k//2 preserves the spatial dimensions throughout. This sequential design is faster than parallel pooling because it reuses intermediate results, and it is applied only at the deepest stage of the backbone where feature maps are smallest (20x20).
class SPPF(nn.Module):
    """Spatial Pyramid Pooling - Fast (SPPF).

    Three sequential 5x5 max-pool operations (equivalent to 5x5, 9x9, 13x13 pooling)
    capture multi-scale context efficiently.
    """

    def __init__(self, in_channels: int, out_channels: int, k: int = 5):
        super().__init__()
        hidden = in_channels // 2
        self.cv1 = ConvBNSiLU(in_channels, hidden, 1)
        self.cv2 = ConvBNSiLU(hidden * 4, out_channels, 1)
        self.pool = nn.MaxPool2d(k, stride=1, padding=k // 2)

    def forward(self, x):
        x = self.cv1(x)
        y1 = self.pool(x)
        y2 = self.pool(y1)
        y3 = self.pool(y2)
        return self.cv2(torch.cat([x, y1, y2, y3], dim=1))

Full backbone assembly

Now we assemble the complete YOLOv11 backbone by stacking the building blocks we have defined. The backbone is organized into a stem followed by four stages, each performing spatial downsampling (stride 2) and feature refinement:
ComponentOperationOutput ShapeNotes
StemConv 3x3, s=264 x 320 x 320Initial feature extraction
Stage 1Conv 3x3, s=2 + C3k2(n=2)128 x 160 x 160Low-level features
Stage 2Conv 3x3, s=2 + C3k2(n=2)256 x 80 x 80P3 output (stride 8)
Stage 3Conv 3x3, s=2 + C3k2(n=2)512 x 40 x 40P4 output (stride 16)
Stage 4Conv 3x3, s=2 + C3k2(n=2) + SPPF1024 x 20 x 20P5 output (stride 32)
The three outputs (P3, P4, P5) form a feature pyramid that will be further refined by the neck (FPN/PAN) in the next notebook. Small objects are detected at P3 (high resolution, low-level features), while large objects are detected at P5 (low resolution, high-level semantic features).
class YOLOv11Backbone(nn.Module):
    """YOLOv11 backbone producing P3, P4, P5 feature maps.

    Architecture:
        Stem (3->64) -> Stage1 (64->128) -> Stage2 (128->256, P3)
        -> Stage3 (256->512, P4) -> Stage4 (512->1024) -> SPPF (P5)
    """

    def __init__(self, in_channels: int = 3, base_channels: int = 64):
        super().__init__()
        c1 = base_channels       # 64
        c2 = c1 * 2              # 128
        c3 = c2 * 2              # 256
        c4 = c3 * 2              # 512
        c5 = c4 * 2              # 1024

        # Stem
        self.stem = ConvBNSiLU(in_channels, c1, 3, stride=2)

        # Stage 1: downsample + C3k2
        self.stage1_down = ConvBNSiLU(c1, c2, 3, stride=2)
        self.stage1_c3k2 = C3k2(c2, c2, n=2, shortcut=True)

        # Stage 2: downsample + C3k2 -> P3 output
        self.stage2_down = ConvBNSiLU(c2, c3, 3, stride=2)
        self.stage2_c3k2 = C3k2(c3, c3, n=2, shortcut=True)

        # Stage 3: downsample + C3k2 -> P4 output
        self.stage3_down = ConvBNSiLU(c3, c4, 3, stride=2)
        self.stage3_c3k2 = C3k2(c4, c4, n=2, shortcut=True)

        # Stage 4: downsample + C3k2 + SPPF -> P5 output
        self.stage4_down = ConvBNSiLU(c4, c5, 3, stride=2)
        self.stage4_c3k2 = C3k2(c5, c5, n=2, shortcut=True)
        self.sppf = SPPF(c5, c5)

    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
        """Forward pass returning multi-scale features.

        Args:
            x: (B, 3, 640, 640) input images
        Returns:
            p3: (B, 256, 80, 80) - stride 8
            p4: (B, 512, 40, 40) - stride 16
            p5: (B, 1024, 20, 20) - stride 32
        """
        # Stem: 640 -> 320
        x = self.stem(x)

        # Stage 1: 320 -> 160
        x = self.stage1_c3k2(self.stage1_down(x))

        # Stage 2: 160 -> 80 (P3)
        x = self.stage2_c3k2(self.stage2_down(x))
        p3 = x  # 256 channels, 80x80

        # Stage 3: 80 -> 40 (P4)
        x = self.stage3_c3k2(self.stage3_down(x))
        p4 = x  # 512 channels, 40x40

        # Stage 4: 40 -> 20 (P5)
        x = self.stage4_c3k2(self.stage4_down(x))
        p5 = self.sppf(x)  # 1024 channels, 20x20

        return p3, p4, p5

Shape verification

Let us instantiate the backbone and verify that the output feature maps have the expected shapes. This is a critical sanity check: if the shapes are wrong, the downstream neck and head will fail.
# Verify output shapes
backbone = YOLOv11Backbone()
dummy_input = torch.randn(1, 3, 640, 640)

with torch.no_grad():
    p3, p4, p5 = backbone(dummy_input)

print("Input shape:", dummy_input.shape)
print(f"P3 shape: {p3.shape}  (stride 8,  {p3.shape[1]} channels)")
print(f"P4 shape: {p4.shape}  (stride 16, {p4.shape[1]} channels)")
print(f"P5 shape: {p5.shape}  (stride 32, {p5.shape[1]} channels)")

# Verify spatial dimensions
assert p3.shape == (1, 256, 80, 80), f"P3 expected (1, 256, 80, 80), got {p3.shape}"
assert p4.shape == (1, 512, 40, 40), f"P4 expected (1, 512, 40, 40), got {p4.shape}"
assert p5.shape == (1, 1024, 20, 20), f"P5 expected (1, 1024, 20, 20), got {p5.shape}"
print("\nAll shape checks passed!")
Input shape: torch.Size([1, 3, 640, 640])
P3 shape: torch.Size([1, 256, 80, 80])  (stride 8,  256 channels)
P4 shape: torch.Size([1, 512, 40, 40])  (stride 16, 512 channels)
P5 shape: torch.Size([1, 1024, 20, 20])  (stride 32, 1024 channels)

All shape checks passed!

Parameter count

Understanding the parameter distribution across stages helps with model analysis and debugging. The later stages have exponentially more parameters due to the doubling of channel widths.
def count_parameters(model):
    """Count trainable and total parameters."""
    total = sum(p.numel() for p in model.parameters())
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print(f"Total parameters: {total:,}")
    print(f"Trainable parameters: {trainable:,}")
    print(f"Size (MB): {total * 4 / 1024 / 1024:.1f}")
    return total

# Per-stage breakdown
print("=== Parameter Breakdown ===")
for name, module in backbone.named_children():
    params = sum(p.numel() for p in module.parameters())
    print(f"  {name}: {params:,}")
print()
count_parameters(backbone)
=== Parameter Breakdown ===
  stem: 1,856
  stage1_down: 73,984
  stage1_c3k2: 197,632
  stage2_down: 295,424
  stage2_c3k2: 788,480
  stage3_down: 1,180,672
  stage3_c3k2: 3,149,824
  stage4_down: 4,720,640
  stage4_c3k2: 12,591,104
  sppf: 2,624,512

Total parameters: 25,624,128
Trainable parameters: 25,624,128
Size (MB): 97.7
25624128

Feature map visualization

Visualizing the feature maps at each scale gives intuition for what the backbone learns. Even with random weights, we can observe that:
  • P3 (80x80) retains fine spatial detail
  • P4 (40x40) captures medium-scale structure
  • P5 (20x20) shows coarse, high-level patterns
def visualize_feature_maps(features, names, num_channels=8):
    """Visualize first few channels of each feature map."""
    fig, axes = plt.subplots(len(features), num_channels, figsize=(20, 3 * len(features)))

    for i, (feat, name) in enumerate(zip(features, names)):
        feat_np = feat[0].detach().numpy()  # Remove batch dim
        for j in range(min(num_channels, feat_np.shape[0])):
            ax = axes[i, j] if len(features) > 1 else axes[j]
            ax.imshow(feat_np[j], cmap='viridis')
            ax.axis('off')
            if j == 0:
                ax.set_ylabel(name, fontsize=12, rotation=0, labelpad=60)

    plt.suptitle('Feature Map Activations (first 8 channels per scale)', fontsize=14)
    plt.tight_layout()
    plt.show()

# Generate with random input for visualization
with torch.no_grad():
    # Use structured input so feature maps are more interesting
    x = torch.randn(1, 3, 640, 640)
    p3, p4, p5 = backbone(x)

visualize_feature_maps([p3, p4, p5], ['P3 (80x80)', 'P4 (40x40)', 'P5 (20x20)'])
Output from cell 9

Architecture diagram

The following diagram summarizes the complete backbone data flow:
Input (3x640x640)
      |
   +------+
   | Stem |  Conv 3x3, s=2
   +------+  -> 64x320x320
      |
   +------+
   |  S1  |  Conv 3x3, s=2 -> C3k2(n=2)
   +------+  -> 128x160x160
      |
   +------+
   |  S2  |  Conv 3x3, s=2 -> C3k2(n=2)
   +------+  -> 256x80x80 --------> P3
      |
   +------+
   |  S3  |  Conv 3x3, s=2 -> C3k2(n=2)
   +------+  -> 512x40x40 --------> P4
      |
   +------+
   |  S4  |  Conv 3x3, s=2 -> C3k2(n=2) -> SPPF
   +------+  -> 1024x20x20 -------> P5
Each stage doubles the channel count while halving the spatial resolution. The SPPF module is applied only at the deepest level where the computational overhead is minimal but the benefit of multi-scale pooling is greatest.

Summary

In this notebook, we built the complete YOLOv11 backbone from scratch. Here is a recap of the key design choices:
  1. ConvBNSiLU provides a clean, reusable primitive that appears throughout the architecture. Disabling the convolution bias (since batch normalization subsumes it) saves parameters.
  2. Bottleneck blocks with residual connections enable deeper networks without vanishing gradients. The expansion factor controls the compute/accuracy tradeoff.
  3. C3k2 (CSP with 2 convolutions) is more parameter-efficient than C3 while maintaining strong feature extraction. The dense connections (collecting all bottleneck outputs) improve gradient flow and feature reuse.
  4. SPPF captures multi-scale context through sequential max-pooling, enriching the deepest feature map with information from multiple receptive field sizes.
  5. The multi-scale output design (P3, P4, P5) is essential for detecting objects of varying sizes. This feature pyramid will be further refined in the next notebook.

Next steps

In Notebook 3, we will build the FPN/PAN neck that fuses these multi-scale features bidirectionally, and the detection head that produces bounding box predictions and class scores at each scale.