ResNet Skip-Connection Dimensioning and FPN

The core dimensional constraint

Let

x \in \mathbb{R}^{B \times C_{in} \times H \times W}

. A residual unit computes

y = F(x) + \mathcal{S}(x)

and addition requires identical tensor shapes:

F(x), \mathcal{S}(x) \in \mathbb{R}^{B \times C_{out} \times H' \times W'}.

Hence the skip connection must handle two mismatches:

channel mismatch: $C_{in} \neq C_{out}$
spatial mismatch: $(H,W) \neq (H',W')$ (typically caused by stride-2 downsampling)

Residual block — skip connection options

Addition requires identical tensor shapes: both the residual branch and the skip connection must produce

[B, C_{out}, H', W']

ResNet-style block with correct skip connection dimensioning

We implement a standard BasicBlock with:

residual branch: 3×3 conv → BN → ReLU → 3×3 conv → BN
skip connection:
- identity if stride=1 and $C_{in}=C_{out}$
- otherwise a 1×1 conv (projection), with the same stride as the residual branch’s downsampling

import torch
import torch.nn as nn
import torch.nn.functional as F

class BasicBlock(nn.Module):
    def __init__(self, cin: int, cout: int, stride: int = 1):
        super().__init__()
        self.conv1 = nn.Conv2d(cin, cout, kernel_size=3, stride=stride, padding=1, bias=False)
        self.bn1   = nn.BatchNorm2d(cout)
        self.conv2 = nn.Conv2d(cout, cout, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn2   = nn.BatchNorm2d(cout)
        self.relu  = nn.ReLU(inplace=True)

        if stride != 1 or cin != cout:
            # Projection skip connection: matches channels and spatial size.
            self.skip_connection = nn.Sequential(
                nn.Conv2d(cin, cout, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(cout),
            )
        else:
            self.skip_connection = nn.Identity()

    def forward(self, x):
        out = self.relu(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        out = out + self.skip_connection(x)
        out = self.relu(out)
        return out

def report(name: str, t) -> None:
    """Print tensor name, shape, dtype, and device."""
    print(f"{name}: shape={tuple(t.shape)}  dtype={t.dtype}  device={t.device}")

Option A vs. Option B (ResNet paper terminology)

In the ResNet paper’s discussion:

Option A: downsample the skip connection (stride 2) and zero-pad channels to match $C_{out}$ .
Option B: downsample and project with 1×1 conv to match dimensions.

For FPN-style backbones, Option B is the preferred practical choice because:

the feature hierarchy is consumed downstream (e.g., lateral merges), so having a learned projection at stage transitions is robust,
and it matches the canonical ResNet-{50,101,152} “option B” design in the CVPR paper.

Below is a small functional illustration of “Option A-like” padding for the channel mismatch (spatial downsample uses strided slicing for simplicity).

def option_a_skip_connection(x, cout: int, stride: int):
    # Spatial downsample: emulate stride-2 skip connection by subsampling.
    if stride == 2:
        x_ds = x[:, :, ::2, ::2]
    elif stride == 1:
        x_ds = x
    else:
        raise ValueError("This demo only supports stride 1 or 2.")
    cin = x_ds.shape[1]
    if cin > cout:
        raise ValueError("Option A padding demo expects cin <= cout.")
    if cin == cout:
        return x_ds
    pad_c = cout - cin
    # Pad channels: (N,C,H,W). We pad on the channel dimension by concatenating zeros.
    zeros = torch.zeros(x_ds.shape[0], pad_c, x_ds.shape[2], x_ds.shape[3], device=x_ds.device, dtype=x_ds.dtype)
    return torch.cat([x_ds, zeros], dim=1)

# Demonstrate option A-like skip connection shape matching
x = torch.randn(2, 64, 56, 56)
Sx_a = option_a_skip_connection(x, cout=128, stride=2)
report("Option-A-like S(x)", Sx_a)

Option-A-like S(x): shape=(2, 128, 28, 28)  dtype=torch.float32  device=cpu

A minimal ResNet-like backbone that exposes {C2, C3, C4, C5}

FPN (Lin et al.) uses the outputs of each ResNet stage’s last block: {C2, C3, C4, C5} with strides {4, 8, 16, 32} relative to the input. We build a small backbone that mirrors this structure (conceptually like a tiny ResNet-18).

Backbone stage layout — strides and channel widths

Each stage transition uses a stride-2 first block with a 1×1 projection skip connection (Option B) to match dimensions.

class TinyResNetBackbone(nn.Module):
    def __init__(self):
        super().__init__()
        # Stem (like ResNet): stride-2 conv + stride-2 maxpool => output stride 4
        self.stem = nn.Sequential(
            nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False),
            nn.BatchNorm2d(64),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
        )
        # Stages: produce C2..C5
        self.layer1 = nn.Sequential(BasicBlock(64,  64, stride=1), BasicBlock(64,  64, stride=1))  # C2, stride 4
        self.layer2 = nn.Sequential(BasicBlock(64, 128, stride=2), BasicBlock(128, 128, stride=1)) # C3, stride 8
        self.layer3 = nn.Sequential(BasicBlock(128,256, stride=2), BasicBlock(256, 256, stride=1)) # C4, stride 16
        self.layer4 = nn.Sequential(BasicBlock(256,512, stride=2), BasicBlock(512, 512, stride=1)) # C5, stride 32

    def forward(self, x):
        x = self.stem(x)
        c2 = self.layer1(x)
        c3 = self.layer2(c2)
        c4 = self.layer3(c3)
        c5 = self.layer4(c4)
        return {"C2": c2, "C3": c3, "C4": c4, "C5": c5}

backbone = TinyResNetBackbone()
x = torch.randn(1, 3, 224, 224)
C = backbone(x)
for k in ["C2","C3","C4","C5"]:
    report(k, C[k])

C2: shape=(1, 64, 56, 56)  dtype=torch.float32  device=cpu
C3: shape=(1, 128, 28, 28)  dtype=torch.float32  device=cpu
C4: shape=(1, 256, 14, 14)  dtype=torch.float32  device=cpu
C5: shape=(1, 512, 7, 7)  dtype=torch.float32  device=cpu

import matplotlib.pyplot as plt
import numpy as np

stages = ['C2\n(stride 4)', 'C3\n(stride 8)', 'C4\n(stride 16)', 'C5\n(stride 32)']
channels_bb = [64, 128, 256, 512]
spatial_bb  = [56, 28, 14, 7]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(11, 4))
colors = ['#4e79a7', '#f28e2b', '#e15759', '#76b7b2']

bars1 = ax1.bar(stages, channels_bb, color=colors)
ax1.set_ylabel('Channels')
ax1.set_title('Channel width per backbone stage')
for b, v in zip(bars1, channels_bb):
    ax1.text(b.get_x() + b.get_width()/2, b.get_height() + 4, str(v),
             ha='center', fontweight='bold')

bars2 = ax2.bar(stages, spatial_bb, color=colors)
ax2.set_ylabel('Spatial size (H = W, pixels)')
ax2.set_title('Feature map spatial size per backbone stage\n(input 224×224)')
for b, v in zip(bars2, spatial_bb):
    ax2.text(b.get_x() + b.get_width()/2, b.get_height() + 0.4, f'{v}×{v}',
             ha='center', fontweight='bold')

plt.tight_layout()
plt.savefig('backbone_dimensions.png', dpi=120, bbox_inches='tight')
plt.show()

FPN module implementation

Canonical FPN design choices (as in Lin et al.):

1×1 lateral conv to unify channels to $d=256$
top-down upsample by factor 2 (nearest neighbor is typical)
element-wise addition (requires same $H \times W$ and same $d$ )
3×3 conv “smoothing” on each merged map
optional $P6$ via stride-2 3×3 conv on $P5$ (common in detection systems)

FPN top-down pathway — lateral merges and channel unification

The 1×1 lateral convolutions unify heterogeneous backbone channels (64/128/256/512) to a uniform $d=256$ before the element-wise additions. The additions require strict spatial and channel alignment — which the lateral convolutions and upsample guarantee.

class FPN(nn.Module):
    def __init__(self, c2: int, c3: int, c4: int, c5: int, d: int = 256, make_p6: bool = True):
        super().__init__()
        # Lateral 1×1 convs: Ck -> d
        self.lat2 = nn.Conv2d(c2, d, kernel_size=1)
        self.lat3 = nn.Conv2d(c3, d, kernel_size=1)
        self.lat4 = nn.Conv2d(c4, d, kernel_size=1)
        self.lat5 = nn.Conv2d(c5, d, kernel_size=1)

        # Smoothing 3×3 convs on each pyramid level
        self.smooth2 = nn.Conv2d(d, d, kernel_size=3, padding=1)
        self.smooth3 = nn.Conv2d(d, d, kernel_size=3, padding=1)
        self.smooth4 = nn.Conv2d(d, d, kernel_size=3, padding=1)
        self.smooth5 = nn.Conv2d(d, d, kernel_size=3, padding=1)

        self.make_p6 = make_p6
        self.p6 = nn.Conv2d(d, d, kernel_size=3, stride=2, padding=1) if make_p6 else None

    def forward(self, C):
        c2, c3, c4, c5 = C["C2"], C["C3"], C["C4"], C["C5"]

        m5 = self.lat5(c5)
        m4 = self.lat4(c4) + F.interpolate(m5, scale_factor=2.0, mode="nearest")
        m3 = self.lat3(c3) + F.interpolate(m4, scale_factor=2.0, mode="nearest")
        m2 = self.lat2(c2) + F.interpolate(m3, scale_factor=2.0, mode="nearest")

        p5 = self.smooth5(m5)
        p4 = self.smooth4(m4)
        p3 = self.smooth3(m3)
        p2 = self.smooth2(m2)

        out = {"P2": p2, "P3": p3, "P4": p4, "P5": p5}
        if self.make_p6:
            out["P6"] = self.p6(p5)
        return out

fpn = FPN(c2=64, c3=128, c4=256, c5=512, d=256, make_p6=True)

P = fpn(C)
for k in ["P2","P3","P4","P5","P6"]:
    report(k, P[k])

P2: shape=(1, 256, 56, 56)  dtype=torch.float32  device=cpu
P3: shape=(1, 256, 28, 28)  dtype=torch.float32  device=cpu
P4: shape=(1, 256, 14, 14)  dtype=torch.float32  device=cpu
P5: shape=(1, 256, 7, 7)  dtype=torch.float32  device=cpu
P6: shape=(1, 256, 4, 4)  dtype=torch.float32  device=cpu

What “preferred approach for FPN” means (operationally)

In a modern featurizer intended for FPN-style consumption, the pragmatic default is:

Backbone (ResNet-style):
- Identity skip connection if $(C_{in}, H, W)$ matches $(C_{out}, H', W')$
- 1×1 projection skip connection (with stride=2 when downsampling) otherwise
  This matches the ResNet paper’s “projection to match dimensions” guidance and the widespread “option B” practice in deep variants.
FPN neck:
- 1×1 lateral convs to unify all $C2..C5$ to $d=256$ channels
- top-down nearest-neighbor upsample by 2
- elementwise addition
- 3×3 smoothing conv
- optional $P6$ from $P5$ via stride-2 3×3 conv

The key theme is the same in both ResNet and FPN: addition enforces strict shape equality, so dimensioning is not a detail—it is the design constraint.

References (primary sources)

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016. arXiv:1512.03385.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Identity Mappings in Deep Residual Networks. ECCV 2016. arXiv:1603.05027.
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, Serge Belongie. Feature Pyramid Networks for Object Detection. CVPR 2017. arXiv:1612.03144.

Key references: (Wightman et al., 2021; Zagoruyko & Komodakis, 2016; Tan & Le, 2019; Dong et al., 2017; He et al., 2016)

References

Dong, X., Wu, J., Zhou, L. (2017). How deep learning works -The geometry of deep learning.
He, K., Zhang, X., Ren, S., Sun, J. (2016). Identity mappings in deep residual networks.
Tan, M., Le, Q. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.
Wightman, R., Touvron, H., Jégou, H. (2021). ResNet strikes back: An improved training procedure in timm.
Zagoruyko, S., Komodakis, N. (2016). Wide Residual Networks.

Edit this page on GitHub or file an issue.

Perception

Sensor Models

CNNs

Scene Understanding

Faster RCNN Lab

YOLO Lab

UNet Lab

Mask RCNN Lab

State Estimation

Mapping

ResNet Skip-Connection Dimensioning and FPN

The core dimensional constraint

Residual block — skip connection options

ResNet-style block with correct skip connection dimensioning

Option A vs. Option B (ResNet paper terminology)

A minimal ResNet-like backbone that exposes {C2, C3, C4, C5}

Backbone stage layout — strides and channel widths

FPN module implementation

FPN top-down pathway — lateral merges and channel unification

What “preferred approach for FPN” means (operationally)

References (primary sources)

References

Perception

Sensor Models

CNNs

Scene Understanding

Faster RCNN Lab

YOLO Lab

UNet Lab

Mask RCNN Lab

State Estimation

Mapping

​The core dimensional constraint

​Residual block — skip connection options

​ResNet-style block with correct skip connection dimensioning

​Option A vs. Option B (ResNet paper terminology)

​A minimal ResNet-like backbone that exposes {C2, C3, C4, C5}

​Backbone stage layout — strides and channel widths

​FPN module implementation

​FPN top-down pathway — lateral merges and channel unification

​What “preferred approach for FPN” means (operationally)

​References (primary sources)

​References

The core dimensional constraint

Residual block — skip connection options

ResNet-style block with correct skip connection dimensioning

Option A vs. Option B (ResNet paper terminology)

A minimal ResNet-like backbone that exposes {C2, C3, C4, C5}

Backbone stage layout — strides and channel widths

FPN module implementation

FPN top-down pathway — lateral merges and channel unification

What “preferred approach for FPN” means (operationally)

References (primary sources)

References