Training Faster RCNN End-to-End
Notebook 5 of 6 in the Faster RCNN from-scratch series
This notebook assembles all components (backbone + FPN, RPN, ROI head) into a
single FasterRCNN module and trains it on COCO data streamed from Hugging Face.
Scope: a short training demo (5 gradient steps) that verifies the full
forward + backward pass and saves a checkpoint for notebook 06.
Memory notes: we use:
- 400 × 400 input resolution (vs the canonical 800 × 800) to fit in ~16 GB VRAM
- PyTorch AMP (automatic mixed precision) — forward in float16, gradients in float32
- Frozen backbone stem + layer1/2/3 (only layer4, FPN, RPN, ROI head are trained)
import sys, os, pathlib
# Locate frcnn_common.py — works whether run via papermill or interactively
_nb_candidates = [
pathlib.Path.cwd().parent, # interactive: cwd is the notebook dir
pathlib.Path.cwd() / 'notebooks' / 'scene-understanding' / 'object-detection' / 'faster-rcnn' / 'pytorch', # papermill: cwd is repo root
]
for _p in _nb_candidates:
if (_p / 'frcnn_common.py').exists():
sys.path.insert(0, str(_p))
break
import torch
import torch.nn as nn
import matplotlib.pyplot as plt
from torch.utils.data import DataLoader
from frcnn_common import (
IMG_SIZE, NUM_CLASSES, DEVICE,
COCOStreamDataset, frcnn_collate_fn,
FasterRCNN,
)
print(f"Device: {DEVICE}")
if DEVICE.type == 'cuda':
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM total: {torch.cuda.get_device_properties(0).total_memory/1024**3:.1f} GB")
/workspaces/eng-ai-agents/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
Device: cuda
GPU: NVIDIA RTX A4500 Laptop GPU
VRAM total: 15.6 GB
# ─── 1. Data pipeline (imported from frcnn_common) ─────────────────────────────
ds = COCOStreamDataset(split='train', max_samples=2)
imgs, tgts = frcnn_collate_fn(list(ds))
print(f"Batch images : {imgs.shape}")
print(f"GT boxes : {[t['boxes'].shape for t in tgts]}")
Batch images : torch.Size([2, 3, 400, 400])
GT boxes : [torch.Size([8, 4]), torch.Size([2, 4])]
# ─── 2. Backbone: ResNet50 + FPN (imported from frcnn_common) ─────────────────
print("Backbone and FPN imported from frcnn_common.")
Backbone and FPN imported from frcnn_common.
# ─── 3. RPN (imported from frcnn_common) ─────────────────────────────────────
print("RPN components imported from frcnn_common.")
RPN components imported from frcnn_common.
# ─── 4. ROI Head (imported from frcnn_common) ────────────────────────────────
print("ROI head components imported from frcnn_common.")
ROI head components imported from frcnn_common.
# ─── 5. FasterRCNN module (imported from frcnn_common) ───────────────────────
# Quick forward check on CPU
model = FasterRCNN(num_classes=NUM_CLASSES)
model.train()
with torch.no_grad():
dummy_imgs = torch.randn(1, 3, 600, 600)
dummy_tgts = [{'boxes': torch.tensor([[50.,50.,250.,250.],[100.,100.,400.,400.]]),
'labels': torch.tensor([3, 7])}]
losses_check = model(dummy_imgs, dummy_tgts)
print("Loss keys:", list(losses_check.keys()))
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"Parameters: {total/1e6:.1f}M total | {trainable/1e6:.1f}M trainable")
Loss keys: ['rpn_cls', 'rpn_box', 'roi_cls', 'roi_box']
Parameters: 41.8M total | 33.2M trainable
Loss terms
Faster RCNN minimizes four losses jointly:
| Loss | Component | What it optimizes |
|---|
rpn_cls | RPN | Distinguish foreground anchors (overlap GT ≥ 0.7) from background (overlap < 0.3) |
rpn_box | RPN | Regress anchor → GT box for positive anchors (smooth-L1) |
roi_cls | ROI head | Classify each proposal into one of 80 COCO categories or background |
roi_box | ROI head | Regress proposal → GT box for positive ROIs, class-specific (smooth-L1) |
The total loss is their unweighted sum. Early in training roi_cls dominates because the randomly initialized classifier is far from correct; rpn_box and roi_box decrease as the regressors learn offsets.
Training scale: this demo runs for only 5 gradient steps — enough to confirm that
the full forward + backward pass works and that all four losses decrease.
It is not a converged model. Real COCO training requires ~90 000 steps (~12 epochs)
on a multi-GPU machine. The checkpoint saved at the end of this notebook is used only
to verify the inference pipeline in notebook 06.
# ─── 6. Training demo (5 gradient steps) ──────────────────────────────────────
model = FasterRCNN(num_classes=NUM_CLASSES).to(DEVICE)
optimizer = torch.optim.SGD(
[p for p in model.parameters() if p.requires_grad],
lr=0.005, momentum=0.9, weight_decay=1e-4)
scaler = torch.amp.GradScaler('cuda')
TRAIN_STEPS = 5
train_ds = COCOStreamDataset(split='train', max_samples=TRAIN_STEPS)
train_dl = DataLoader(train_ds, batch_size=1, collate_fn=frcnn_collate_fn)
model.train()
history = []
for step, (images, targets) in enumerate(train_dl):
images = images.to(DEVICE)
targets = [{k: v.to(DEVICE) for k, v in t.items()} for t in targets]
# ── Core 5-step training loop ──────────────────────────────────────────────
# 1. Zero gradients
optimizer.zero_grad()
# 2. Forward pass (AMP: mixed precision for memory efficiency)
with torch.amp.autocast('cuda'):
losses = model(images, targets)
total = sum(losses.values())
# 3. Backward pass (scaled for AMP)
scaler.scale(total).backward()
# 4. Gradient clipping (stability: prevents exploding gradients)
scaler.unscale_(optimizer)
nn.utils.clip_grad_norm_([p for p in model.parameters() if p.requires_grad],
max_norm=10.0)
# 5. Optimizer step
scaler.step(optimizer)
scaler.update()
info = {k: f"{v.item():.4f}" for k, v in losses.items()}
info['total'] = f"{total.item():.4f}"
history.append({k: float(v.item()) for k, v in {**losses, 'total': total}.items()})
print(f"Step {step+1}/{TRAIN_STEPS} {info}")
print("\nTraining demo complete.")
Step 1/5 {'rpn_cls': '0.6837', 'rpn_box': '0.0983', 'roi_cls': '4.4301', 'roi_box': '0.0001', 'total': '5.2122'}
Step 2/5 {'rpn_cls': '0.6723', 'rpn_box': '0.1660', 'roi_cls': '4.0081', 'roi_box': '0.0000', 'total': '4.8464'}
Step 3/5 {'rpn_cls': '0.6485', 'rpn_box': '0.0649', 'roi_cls': '3.1855', 'roi_box': '0.0272', 'total': '3.9261'}
Step 4/5 {'rpn_cls': '0.6520', 'rpn_box': '0.1292', 'roi_cls': '3.1818', 'roi_box': '0.0001', 'total': '3.9631'}
Step 5/5 {'rpn_cls': '0.6011', 'rpn_box': '0.1212', 'roi_cls': '1.6939', 'roi_box': '0.0611', 'total': '2.4773'}
Training demo complete.
# ─── 7. Loss curves ────────────────────────────────────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
ax = axes[0]
for k in [kk for kk in history[0] if kk != 'total']:
ax.plot([h[k] for h in history], label=k, marker='o')
ax.set_xlabel('Step'); ax.set_ylabel('Loss')
ax.set_title('Individual Loss Components (5 steps)'); ax.legend()
axes[1].plot([h['total'] for h in history], 'r-o', linewidth=2)
axes[1].set_xlabel('Step'); axes[1].set_ylabel('Total loss')
axes[1].set_title('Total Loss (5 steps)')
plt.tight_layout()
os.makedirs('images', exist_ok=True)
plt.savefig('images/loss_curves.png', dpi=100, bbox_inches='tight')
plt.show()
The training loop above includes three techniques that are not part of the core algorithm but are essential in practice:
| Technique | Why it helps |
|---|
AMP (torch.amp.autocast) | Runs forward in float16 → halves VRAM; GradScaler prevents underflow in backward |
Gradient clipping (clip_grad_norm_) | Caps the gradient norm at 10.0 — prevents loss spikes when the RPN proposes very large boxes early in training |
Gradient checkpointing (torch.utils.checkpoint) | Already applied in ResNet50.forward for layer3/layer4 — trades recomputation for memory |
For a production training run you would also add:
- LR warm-up — ramp learning rate from 0 to 0.005 over the first 500 steps before any decay
- Multi-step LR decay — drop by 0.1× at epochs 8 and 11 (standard Detectron2 schedule)
- Periodic checkpointing — save every N steps, not just at the end
# ─── 8. Save checkpoint ────────────────────────────────────────────────────────
os.makedirs('checkpoints', exist_ok=True)
ckpt_path = 'checkpoints/faster_rcnn_demo.pth'
torch.save({
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'steps_trained': TRAIN_STEPS,
'num_classes': NUM_CLASSES,
'final_losses': history[-1],
}, ckpt_path)
size_mb = os.path.getsize(ckpt_path) / 1024**2
print(f"Checkpoint saved → {ckpt_path} ({size_mb:.1f} MB)")
print(f"Final losses: { {k: f'{v:.4f}' for k,v in history[-1].items()} }")
Checkpoint saved → checkpoints/faster_rcnn_demo.pth (286.3 MB)
Final losses: {'rpn_cls': '0.6011', 'rpn_box': '0.1212', 'roi_cls': '1.6939', 'roi_box': '0.0611', 'total': '2.4773'}
Key references: (Wightman et al., 2021; Redmon & Farhadi, 2016; Zagoruyko & Komodakis, 2016; Szegedy et al., 2016; Tan & Le, 2019)
References
- Redmon, J., Farhadi, A. (2016). YOLO9000: Better, Faster, Stronger.
- Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A. (2016). Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning.
- Tan, M., Le, Q. (2019). EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.
- Wightman, R., Touvron, H., Jégou, H. (2021). ResNet strikes back: An improved training procedure in timm.
- Zagoruyko, S., Komodakis, N. (2016). Wide Residual Networks.