Building the YOLOv11 Backbone
Notebook 2 of 5 in the YOLOv11 from-scratch seriesIntroduction
The backbone is the feature extraction engine of any object detection model. In YOLOv11, the backbone extracts hierarchical features at multiple spatial scales, enabling the detector to find objects ranging from small pedestrians to large vehicles in a single forward pass.Key innovations in the YOLOv11 backbone
- C3k2 block - A Cross Stage Partial (CSP) bottleneck that uses 2 convolutions instead of 3. It splits the input channels, processes one branch through a series of bottleneck blocks, collects intermediate outputs, concatenates everything, and projects back. This is more parameter-efficient than the older C3 block while achieving similar representational power.
- SPPF (Spatial Pyramid Pooling - Fast) - Applies three sequential 5x5 max-pooling operations (equivalent to 5x5, 9x9, and 13x13 receptive fields) to capture multi-scale contextual information without increasing computational cost significantly.
-
Multi-scale outputs - The backbone produces three feature maps at different resolutions:
- P3: stride 8 (80x80 for 640x640 input) - fine-grained features for small objects
- P4: stride 16 (40x40) - mid-level features for medium objects
- P5: stride 32 (20x20) - coarse features with large receptive field for large objects
Imports
Building blocks: Conv-BN-SiLU
Every convolutional layer in modern YOLO architectures follows the same pattern:- Convolution (
nn.Conv2d) - the learnable spatial filter, withbias=Falsesince batch normalization handles the bias term. - Batch Normalization (
nn.BatchNorm2d) - normalizes activations across the batch, stabilizing training and allowing higher learning rates. - SiLU activation (also known as Swish: ) - a smooth, non-monotonic activation that consistently outperforms ReLU in detection tasks.
ConvBNSiLU module. The padding parameter defaults to kernel_size // 2, which preserves spatial dimensions for odd kernel sizes (the standard choice).
Bottleneck block
TheBottleneck is the fundamental processing unit inside CSP blocks. It consists of two convolutions:
- A squeeze convolution that reduces channels by the
expansionfactor (default 0.5). - An expand convolution that restores the channel count.
shortcut=True and the input/output channel counts match, a residual connection adds the input directly to the output. This identity shortcut helps gradients flow through deep networks and has been a cornerstone of modern architectures since ResNet.
The kernel_size parameter accepts a tuple (k1, k2) to independently set the kernel size for each convolution. YOLOv11’s C3k2 block uses (3, 3) by default.
C3k2 block: CSP with 2 convolutions
The C3k2 (Cross Stage Partial with 2 convolutions) block is a key architectural element of YOLOv11. It improves upon earlier CSP designs (C3, C2f) by being more parameter-efficient.How CSP works
The Cross Stage Partial (CSP) design philosophy is:- Split: A 1x1 convolution (
cv1) projects the input into2 * hidden_channels, then the output is split (chunked) into two equal halves along the channel dimension. - Transform: One half passes through a series of
nbottleneck blocks. Crucially, each bottleneck’s output is collected (not just the final one), creating a dense connection pattern. - Concatenate: The original split half, plus all
nbottleneck outputs (total of2 + nfeature groups), are concatenated along the channel dimension. - Project: A final 1x1 convolution (
cv2) fuses the concatenated features back to the desired output channel count.
cv1 and cv2), distinguishing it from C3 which uses three. The default kernel size pair (3, 3) in each bottleneck gives the block its name suffix “k2” (2 kernels of size 3).
SPPF: Spatial Pyramid Pooling - Fast
The SPPF (Spatial Pyramid Pooling - Fast) module addresses a fundamental challenge: how to capture context at multiple spatial scales without drastically increasing computation.Design
The original SPP module applied max-pooling with three different kernel sizes (5, 9, 13) in parallel. SPPF achieves the same effective receptive fields by applying a single 5x5 max-pool operation three times sequentially:- After 1 pool: effective receptive field of 5x5
- After 2 pools: effective receptive field of 9x9
- After 3 pools: effective receptive field of 13x13
stride=1 and padding=k//2 preserves the spatial dimensions throughout.
This sequential design is faster than parallel pooling because it reuses intermediate results, and it is applied only at the deepest stage of the backbone where feature maps are smallest (20x20).
Full backbone assembly
Now we assemble the complete YOLOv11 backbone by stacking the building blocks we have defined. The backbone is organized into a stem followed by four stages, each performing spatial downsampling (stride 2) and feature refinement:| Component | Operation | Output Shape | Notes |
|---|---|---|---|
| Stem | Conv 3x3, s=2 | 64 x 320 x 320 | Initial feature extraction |
| Stage 1 | Conv 3x3, s=2 + C3k2(n=2) | 128 x 160 x 160 | Low-level features |
| Stage 2 | Conv 3x3, s=2 + C3k2(n=2) | 256 x 80 x 80 | P3 output (stride 8) |
| Stage 3 | Conv 3x3, s=2 + C3k2(n=2) | 512 x 40 x 40 | P4 output (stride 16) |
| Stage 4 | Conv 3x3, s=2 + C3k2(n=2) + SPPF | 1024 x 20 x 20 | P5 output (stride 32) |
Shape verification
Let us instantiate the backbone and verify that the output feature maps have the expected shapes. This is a critical sanity check: if the shapes are wrong, the downstream neck and head will fail.Parameter count
Understanding the parameter distribution across stages helps with model analysis and debugging. The later stages have exponentially more parameters due to the doubling of channel widths.Feature map visualization
Visualizing the feature maps at each scale gives intuition for what the backbone learns. Even with random weights, we can observe that:- P3 (80x80) retains fine spatial detail
- P4 (40x40) captures medium-scale structure
- P5 (20x20) shows coarse, high-level patterns

Architecture diagram
The following diagram summarizes the complete backbone data flow:Summary
In this notebook, we built the complete YOLOv11 backbone from scratch. Here is a recap of the key design choices:- ConvBNSiLU provides a clean, reusable primitive that appears throughout the architecture. Disabling the convolution bias (since batch normalization subsumes it) saves parameters.
- Bottleneck blocks with residual connections enable deeper networks without vanishing gradients. The expansion factor controls the compute/accuracy tradeoff.
- C3k2 (CSP with 2 convolutions) is more parameter-efficient than C3 while maintaining strong feature extraction. The dense connections (collecting all bottleneck outputs) improve gradient flow and feature reuse.
- SPPF captures multi-scale context through sequential max-pooling, enriching the deepest feature map with information from multiple receptive field sizes.
- The multi-scale output design (P3, P4, P5) is essential for detecting objects of varying sizes. This feature pyramid will be further refined in the next notebook.

