- Extracts dimensioning-relevant excerpts from the three foundational papers — ResNet (He et al., CVPR 2016), Identity Mappings (He et al., ECCV 2016), and FPN (Lin et al., CVPR 2017).
- Demonstrates the preferred, modern featurizer pattern: a ResNet-style backbone with identity shortcuts when shapes match, 1×1 projection shortcuts when they don’t, feeding into an FPN neck that unifies all pyramid levels to channels.
- ResNet: arXiv:1512.03385
- Identity mappings: arXiv:1603.05027
- FPN: arXiv:1612.03144
1) Dimensioning-relevant excerpts (text) and figure takeaways
A. ResNet (He et al., CVPR 2016)
Short excerpt (dimension matching via projection; note the explicit stride-2 handling):“The projection shortcut … is used to match dimensions (done by 1×1 convolutions).”
(He et al., 2016, Sec. 3.3)
“When the shortcuts go across feature maps of two sizes, they are performed with a stride of 2.”(He et al., 2016, Sec. 3.3) Short excerpt (options for increased dimensions):
(He et al., 2016, Sec. 3.3)
“(A) … identity mapping, with extra zero entries padded … (B) … projection shortcut … to match dimensions”Figure takeaways (do not reproduce the copyrighted figures here; consult the paper figures directly):
(He et al., 2016, Sec. 3.3)
- ResNet Fig. 3 (residual block): illustrates the residual branch and the shortcut branch being added; addition requires exact shape match.
- ResNet Fig. 5 (bottleneck block): shows the 1×1–3×3–1×1 pattern that reduces then restores channels (e.g., 256→64→64→256); the shortcut is typically identity when input/output shapes match.
B. Identity Mappings (He et al., ECCV 2016)
Short excerpt (why identity shortcuts are special in signal propagation):“forward and backward signals can be directly propagated … when using identity mappings as the skip connections …”Figure takeaway:
(He et al., 2016, Abstract) (He et al., 2016, Abstract)
- The paper analyzes variants of residual units and shows that moving toward “cleaner” identity shortcuts improves optimization/propagation (conceptual motivation for using projections only when necessary).
C. FPN (Lin et al., CVPR 2017)
Short excerpt (lateral dimension reduction for addition):“the corresponding bottom-up map (which undergoes a 1×1 convolutional layer to reduce channel dimensions) [is merged] by element-wise addition.”Short excerpt (the canonical design):
(Lin et al., 2017, Sec. 3)
“We set d = 256 … thus all extra convolutional layers have 256-channel outputs.”Figure takeaways:
(Lin et al., 2017, Sec. 3)
- FPN Fig. 3 (building block): shows upsample (×2) + 1×1 lateral conv then addition, followed by a 3×3 “smoothing” conv. The addition imposes strict spatial and channel alignment.
2) The core dimensional constraint
Let . A residual unit computes and addition requires identical tensor shapes: Hence the shortcut must handle two mismatches:- channel mismatch:
- spatial mismatch: (typically caused by stride-2 downsampling)
Residual block — shortcut options
Addition requires identical tensor shapes: both the residual branch and the shortcut must produce .3) ResNet-style block with correct shortcut dimensioning
We implement a standard BasicBlock with:- residual branch: 3×3 conv → BN → ReLU → 3×3 conv → BN
- shortcut:
- identity if stride=1 and
- otherwise a 1×1 conv (projection), with the same stride as the residual branch’s downsampling
3.1) What goes wrong if you try to add mismatched tensors?
Below we show:- a block that downsamples and increases channels (stride=2, 64→128)
- naive identity shortcut fails (shape mismatch)
- projection shortcut works
3.2) Option A vs. Option B (ResNet paper terminology)
In the ResNet paper’s discussion:- Option A: downsample the shortcut (stride 2) and zero-pad channels to match .
- Option B: downsample and project with 1×1 conv to match dimensions.
- the feature hierarchy is consumed downstream (e.g., lateral merges), so having a learned projection at stage transitions is robust,
- and it matches the canonical ResNet-{50,101,152} “option B” design in the CVPR paper.
4) A minimal ResNet-like backbone that exposes {C2, C3, C4, C5}
FPN (Lin et al.) uses the outputs of each ResNet stage’s last block: {C2, C3, C4, C5} with strides {4, 8, 16, 32} relative to the input. We build a small backbone that mirrors this structure (conceptually like a tiny ResNet-18).Backbone stage layout — strides and channel widths
Each stage transition uses a stride-2 first block with a 1×1 projection shortcut (Option B) to match dimensions.
5) Preferred FPN module (top-down + lateral, with )
Canonical FPN design choices (as in Lin et al.):- 1×1 lateral conv to unify channels to
- top-down upsample by factor 2 (nearest neighbor is typical)
- element-wise addition (requires same and same )
- 3×3 conv “smoothing” on each merged map
- optional via stride-2 3×3 conv on (common in detection systems)
FPN top-down pathway — lateral merges and channel unification
The 1×1 lateral convolutions unify heterogeneous backbone channels (64/128/256/512) to a uniform before the element-wise additions. The additions require strict spatial and channel alignment — which the lateral convolutions and upsample guarantee.5.1) Sanity checks: the additions are well-defined
Each merge is of the form: so we assert shape equality at each merge point.
6) What “preferred approach for FPN” means (operationally)
In a modern featurizer intended for FPN-style consumption, the pragmatic default is:-
Backbone (ResNet-style):
- Identity shortcut if matches
- 1×1 projection shortcut (with stride=2 when downsampling) otherwise
This matches the ResNet paper’s “projection to match dimensions” guidance and the widespread “option B” practice in deep variants.
-
FPN neck:
- 1×1 lateral convs to unify all to channels
- top-down nearest-neighbor upsample by 2
- elementwise addition
- 3×3 smoothing conv
- optional from via stride-2 3×3 conv
References (primary sources)
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016. arXiv:1512.03385.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Identity Mappings in Deep Residual Networks. ECCV 2016. arXiv:1603.05027.
- Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, Serge Belongie. Feature Pyramid Networks for Object Detection. CVPR 2017. arXiv:1612.03144.

