Feature Aggregation and Anchor-Free Detection Head
Notebook 3 of 5 in the YOLOv11 From-Scratch Series In this notebook we build the neck and detection head of YOLOv11. The backbone (Notebook 2) produces multi-scale feature maps P3, P4, and P5, but these features are not yet ready for detection:- Deep features (P5) have strong semantics but poor spatial resolution.
- Shallow features (P3) have fine spatial detail but weak semantics.
- FPN (Feature Pyramid Network) --- top-down pathway that propagates high-level semantic information to lower-level features.
- PAN (Path Aggregation Network) --- bottom-up pathway that propagates strong localization signals back up.
- C2PSA (Channel Attention) --- lightweight partial self-attention for feature refinement.
- Classification logits --- probability distribution over object classes.
- Box regression offsets --- encoded via Distribution Focal Loss (DFL) for precise localization.
Imports
Backbone building blocks (from Notebook 2)
The following cells re-define the backbone building blocks introduced in Notebook 2. They are reproduced here in compact form so that this notebook is fully self-contained. Refer to Notebook 2 for detailed explanations of each module.FPN: Top-Down Path
The Feature Pyramid Network (FPN) implements a top-down pathway that enriches lower-resolution, semantically strong features with higher-resolution spatial information. The process works as follows:- P5 is upsampled (nearest-neighbor interpolation) and concatenated with P4. A C3k2 block fuses the concatenated features.
- The fused P4 is upsampled and concatenated with P3. Another C3k2 block produces the final FPN P3 output.
PAN: Bottom-Up Path
The Path Aggregation Network (PAN) complements the FPN with a bottom-up pathway. While FPN carries semantic information downward, PAN carries strong localization features back upward:- FPN P3 is downsampled (stride-2 convolution) and concatenated with FPN P4. A C3k2 block fuses them.
- The fused P4 is downsampled and concatenated with P5. Another C3k2 block produces the final PAN P5 output.
C2PSA: Channel Attention Block
Partial Self-Attention (PSA) is a lightweight attention mechanism introduced in YOLOv11 that selectively emphasizes the most informative channels in a feature map while suppressing noise. The C2PSA block follows the CSP (Cross Stage Partial) pattern:- Split the input channels into two halves.
- Process one half through bottleneck layers with channel attention (squeeze-excitation style).
- Concatenate the unprocessed half with the attended output.
DFL: Distribution Focal Loss Head
Traditional object detectors regress four continuous values (e.g., center offsets and width/height) for each bounding box. Distribution Focal Loss (DFL) takes a fundamentally different approach: instead of predicting a single scalar per box boundary, the network predicts a discrete probability distribution over a set ofreg_max bins.
Why distributions instead of scalars?
- Ambiguity modeling: Object boundaries are often ambiguous (occlusion, blur). A distribution naturally represents this uncertainty.
- Better optimization: The softmax-based formulation provides smoother gradients than direct regression.
- Improved small-object accuracy: The expected-value computation gives sub-bin precision.
How it works
For each of the 4 box boundaries (left, top, right, bottom):- The network outputs
reg_maxlogits. - A softmax converts them to a probability distribution.
- The final offset is the expected value: .
Decoupled Detection Head
Full YOLOv11 Model Assembly
We now assemble the complete YOLOv11 model by combining backbone, FPN + PAN neck, C2PSA attention, and decoupled detection heads into a single end-to-end module.Shape Verification
Let us verify that the full model produces outputs of the expected shapes at each scale level.Parameter Count
Architecture Visualization
The following diagram illustrates the complete information flow through the YOLOv11 architecture: backbone feature extraction, FPN top-down fusion, PAN bottom-up fusion, and the per-scale detection heads.
Summary
In this notebook we built the complete YOLOv11 architecture on top of the backbone from Notebook 2:- FPN (top-down) propagates high-level semantics from P5 down to P3, ensuring that every scale level understands what objects are present.
- PAN (bottom-up) propagates strong localization features from P3 back up to P5, ensuring that every scale level knows where objects are.
- C2PSA applies lightweight channel attention to P5, allowing the network to selectively emphasize the most informative features.
- DFL (Distribution Focal Loss) replaces direct box regression with a discrete distribution over offsets, enabling more precise localization---especially for small objects.
- Decoupled detection heads separate classification and regression into independent branches, allowing each task to specialize without interfering with the other.

