Skip to main content
Open In Colab

The core dimensional constraint

Let xRB×Cin×H×Wx \in \mathbb{R}^{B \times C_{in} \times H \times W}. A residual unit computes y=F(x)+S(x)y = F(x) + \mathcal{S}(x) and addition requires identical tensor shapes: F(x),S(x)RB×Cout×H×W.F(x), \mathcal{S}(x) \in \mathbb{R}^{B \times C_{out} \times H' \times W'}. Hence the skip connection must handle two mismatches:
  • channel mismatch: CinCoutC_{in} \neq C_{out}
  • spatial mismatch: (H,W)(H,W)(H,W) \neq (H',W') (typically caused by stride-2 downsampling)

Residual block — skip connection options

Addition requires identical tensor shapes: both the residual branch and the skip connection must produce [B,Cout,H,W][B, C_{out}, H', W'].

ResNet-style block with correct skip connection dimensioning

We implement a standard BasicBlock with:
  • residual branch: 3×3 conv → BN → ReLU → 3×3 conv → BN
  • skip connection:
    • identity if stride=1 and Cin=CoutC_{in}=C_{out}
    • otherwise a 1×1 conv (projection), with the same stride as the residual branch’s downsampling
import torch
import torch.nn as nn
import torch.nn.functional as F

class BasicBlock(nn.Module):
    def __init__(self, cin: int, cout: int, stride: int = 1):
        super().__init__()
        self.conv1 = nn.Conv2d(cin, cout, kernel_size=3, stride=stride, padding=1, bias=False)
        self.bn1   = nn.BatchNorm2d(cout)
        self.conv2 = nn.Conv2d(cout, cout, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn2   = nn.BatchNorm2d(cout)
        self.relu  = nn.ReLU(inplace=True)

        if stride != 1 or cin != cout:
            # Projection skip connection: matches channels and spatial size.
            self.skip_connection = nn.Sequential(
                nn.Conv2d(cin, cout, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(cout),
            )
        else:
            self.skip_connection = nn.Identity()

    def forward(self, x):
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out = out + self.skip_connection(x)
        out = self.relu(out)
        return out

def report(name: str, t) -> None:
    """Print tensor name, shape, dtype, and device."""
    print(f"{name}: shape={tuple(t.shape)}  dtype={t.dtype}  device={t.device}")

Option A vs. Option B (ResNet paper terminology)

In the ResNet paper’s discussion:
  • Option A: downsample the skip connection (stride 2) and zero-pad channels to match CoutC_{out}.
  • Option B: downsample and project with 1×1 conv to match dimensions.
For FPN-style backbones, Option B is the preferred practical choice because:
  • the feature hierarchy is consumed downstream (e.g., lateral merges), so having a learned projection at stage transitions is robust,
  • and it matches the canonical ResNet-{50,101,152} “option B” design in the CVPR paper.
Below is a small functional illustration of “Option A-like” padding for the channel mismatch (spatial downsample uses strided slicing for simplicity).
def option_a_skip_connection(x, cout: int, stride: int):
    # Spatial downsample: emulate stride-2 skip connection by subsampling.
    if stride == 2:
        x_ds = x[:, :, ::2, ::2]
    elif stride == 1:
        x_ds = x
    else:
        raise ValueError("This demo only supports stride 1 or 2.")
    cin = x_ds.shape[1]
    if cin > cout:
        raise ValueError("Option A padding demo expects cin <= cout.")
    if cin == cout:
        return x_ds
    pad_c = cout - cin
    # Pad channels: (N,C,H,W). We pad on the channel dimension by concatenating zeros.
    zeros = torch.zeros(x_ds.shape[0], pad_c, x_ds.shape[2], x_ds.shape[3], device=x_ds.device, dtype=x_ds.dtype)
    return torch.cat([x_ds, zeros], dim=1)

# Demonstrate option A-like skip connection shape matching
x = torch.randn(2, 64, 56, 56)
Sx_a = option_a_skip_connection(x, cout=128, stride=2)
report("Option-A-like S(x)", Sx_a)
Option-A-like S(x): shape=(2, 128, 28, 28)  dtype=torch.float32  device=cpu

A minimal ResNet-like backbone that exposes {C2, C3, C4, C5}

FPN (Lin et al.) uses the outputs of each ResNet stage’s last block: {C2, C3, C4, C5} with strides {4, 8, 16, 32} relative to the input. We build a small backbone that mirrors this structure (conceptually like a tiny ResNet-18).

Backbone stage layout — strides and channel widths

Each stage transition uses a stride-2 first block with a 1×1 projection skip connection (Option B) to match dimensions.
class TinyResNetBackbone(nn.Module):
    def __init__(self):
        super().__init__()
        # Stem (like ResNet): stride-2 conv + stride-2 maxpool => output stride 4
        self.stem = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
        )
        # Stages: produce C2..C5
        self.layer1 = nn.Sequential(BasicBlock(64,  64, stride=1), BasicBlock(64,  64, stride=1))  # C2, stride 4
        self.layer2 = nn.Sequential(BasicBlock(64, 128, stride=2), BasicBlock(128, 128, stride=1)) # C3, stride 8
        self.layer3 = nn.Sequential(BasicBlock(128,256, stride=2), BasicBlock(256, 256, stride=1)) # C4, stride 16
        self.layer4 = nn.Sequential(BasicBlock(256,512, stride=2), BasicBlock(512, 512, stride=1)) # C5, stride 32

    def forward(self, x):
        x = self.stem(x)
        c2 = self.layer1(x)
        c3 = self.layer2(c2)
        c4 = self.layer3(c3)
        c5 = self.layer4(c4)
        return {"C2": c2, "C3": c3, "C4": c4, "C5": c5}

backbone = TinyResNetBackbone()
x = torch.randn(1, 3, 224, 224)
C = backbone(x)
for k in ["C2","C3","C4","C5"]:
    report(k, C[k])
C2: shape=(1, 64, 56, 56)  dtype=torch.float32  device=cpu
C3: shape=(1, 128, 28, 28)  dtype=torch.float32  device=cpu
C4: shape=(1, 256, 14, 14)  dtype=torch.float32  device=cpu
C5: shape=(1, 512, 7, 7)  dtype=torch.float32  device=cpu
import matplotlib.pyplot as plt
import numpy as np

stages = ['C2\n(stride 4)', 'C3\n(stride 8)', 'C4\n(stride 16)', 'C5\n(stride 32)']
channels_bb = [64, 128, 256, 512]
spatial_bb  = [56, 28, 14, 7]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(11, 4))
colors = ['#4e79a7', '#f28e2b', '#e15759', '#76b7b2']

bars1 = ax1.bar(stages, channels_bb, color=colors)
ax1.set_ylabel('Channels')
ax1.set_title('Channel width per backbone stage')
for b, v in zip(bars1, channels_bb):
    ax1.text(b.get_x() + b.get_width()/2, b.get_height() + 4, str(v),
             ha='center', fontweight='bold')

bars2 = ax2.bar(stages, spatial_bb, color=colors)
ax2.set_ylabel('Spatial size (H = W, pixels)')
ax2.set_title('Feature map spatial size per backbone stage\n(input 224×224)')
for b, v in zip(bars2, spatial_bb):
    ax2.text(b.get_x() + b.get_width()/2, b.get_height() + 0.4, f'{v}×{v}',
             ha='center', fontweight='bold')

plt.tight_layout()
plt.savefig('backbone_dimensions.png', dpi=120, bbox_inches='tight')
plt.show()
Output from cell 4

FPN module implementation

Canonical FPN design choices (as in Lin et al.):
  • 1×1 lateral conv to unify channels to d=256d=256
  • top-down upsample by factor 2 (nearest neighbor is typical)
  • element-wise addition (requires same H×WH \times W and same dd)
  • 3×3 conv “smoothing” on each merged map
  • optional P6P6 via stride-2 3×3 conv on P5P5 (common in detection systems)

FPN top-down pathway — lateral merges and channel unification

The 1×1 lateral convolutions unify heterogeneous backbone channels (64/128/256/512) to a uniform d=256d=256 before the element-wise additions. The additions require strict spatial and channel alignment — which the lateral convolutions and upsample guarantee.
class FPN(nn.Module):
    def __init__(self, c2: int, c3: int, c4: int, c5: int, d: int = 256, make_p6: bool = True):
        super().__init__()
        # Lateral 1×1 convs: Ck -> d
        self.lat2 = nn.Conv2d(c2, d, kernel_size=1)
        self.lat3 = nn.Conv2d(c3, d, kernel_size=1)
        self.lat4 = nn.Conv2d(c4, d, kernel_size=1)
        self.lat5 = nn.Conv2d(c5, d, kernel_size=1)

        # Smoothing 3×3 convs on each pyramid level
        self.smooth2 = nn.Conv2d(d, d, kernel_size=3, padding=1)
        self.smooth3 = nn.Conv2d(d, d, kernel_size=3, padding=1)
        self.smooth4 = nn.Conv2d(d, d, kernel_size=3, padding=1)
        self.smooth5 = nn.Conv2d(d, d, kernel_size=3, padding=1)

        self.make_p6 = make_p6
        self.p6 = nn.Conv2d(d, d, kernel_size=3, stride=2, padding=1) if make_p6 else None

    def forward(self, C):
        c2, c3, c4, c5 = C["C2"], C["C3"], C["C4"], C["C5"]

        m5 = self.lat5(c5)
        m4 = self.lat4(c4) + F.interpolate(m5, scale_factor=2.0, mode="nearest")
        m3 = self.lat3(c3) + F.interpolate(m4, scale_factor=2.0, mode="nearest")
        m2 = self.lat2(c2) + F.interpolate(m3, scale_factor=2.0, mode="nearest")

        p5 = self.smooth5(m5)
        p4 = self.smooth4(m4)
        p3 = self.smooth3(m3)
        p2 = self.smooth2(m2)

        out = {"P2": p2, "P3": p3, "P4": p4, "P5": p5}
        if self.make_p6:
            out["P6"] = self.p6(p5)
        return out

fpn = FPN(c2=64, c3=128, c4=256, c5=512, d=256, make_p6=True)

P = fpn(C)
for k in ["P2","P3","P4","P5","P6"]:
    report(k, P[k])
P2: shape=(1, 256, 56, 56)  dtype=torch.float32  device=cpu
P3: shape=(1, 256, 28, 28)  dtype=torch.float32  device=cpu
P4: shape=(1, 256, 14, 14)  dtype=torch.float32  device=cpu
P5: shape=(1, 256, 7, 7)  dtype=torch.float32  device=cpu
P6: shape=(1, 256, 4, 4)  dtype=torch.float32  device=cpu

What “preferred approach for FPN” means (operationally)

In a modern featurizer intended for FPN-style consumption, the pragmatic default is:
  1. Backbone (ResNet-style):
    • Identity skip connection if (Cin,H,W)(C_{in}, H, W) matches (Cout,H,W)(C_{out}, H', W')
    • 1×1 projection skip connection (with stride=2 when downsampling) otherwise
      This matches the ResNet paper’s “projection to match dimensions” guidance and the widespread “option B” practice in deep variants.
  2. FPN neck:
    • 1×1 lateral convs to unify all C2..C5C2..C5 to d=256d=256 channels
    • top-down nearest-neighbor upsample by 2
    • elementwise addition
    • 3×3 smoothing conv
    • optional P6P6 from P5P5 via stride-2 3×3 conv
The key theme is the same in both ResNet and FPN: addition enforces strict shape equality, so dimensioning is not a detail—it is the design constraint.

References (primary sources)

  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016. arXiv:1512.03385.
  • Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Identity Mappings in Deep Residual Networks. ECCV 2016. arXiv:1603.05027.
  • Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, Serge Belongie. Feature Pyramid Networks for Object Detection. CVPR 2017. arXiv:1612.03144.
Key references: (Wightman et al., 2021; Zagoruyko & Komodakis, 2016; Tan & Le, 2019; Dong et al., 2017; He et al., 2016)

References

  • Dong, X., Wu, J., Zhou, L. (2017). How deep learning works -The geometry of deep learning.
  • He, K., Zhang, X., Ren, S., Sun, J. (2016). Identity mappings in deep residual networks.
  • Tan, M., Le, Q. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.
  • Wightman, R., Touvron, H., Jégou, H. (2021). ResNet strikes back: An improved training procedure in timm.
  • Zagoruyko, S., Komodakis, N. (2016). Wide Residual Networks.