Faster R-CNN Hyperparameter Optimization with Optuna + W&B (COCO MiniTrain) and Small-Object Transfer (Drones)
This notebook is the deliverable for the assignment:- Training dataset: COCO MiniTrain (COCO-format subset)
https://github.com/giddyyupp/coco-minitrain - Optimization engine: Optuna (TPE + pruning)
- Experiment tracking: Weights & Biases (W&B) (logging + dashboards)
- Generalization test: drone small-object detection (Assignment 3 dataset)
- Run a baseline Faster R-CNN on COCO MiniTrain.
- Run stage-wise hyperparameter optimization with Optuna.
- Log all runs to W&B and analyze them in the W&B UI.
- Evaluate the tuned detector on the drone dataset and discuss transfer.
What you must submit
- A shareable link to your W&B project (public or access granted to the TA).
- This notebook (executed), including:
- baseline run
- Optuna study runs with pruning
- final 3-seed retraining
- drone evaluation (baseline vs tuned)
- analysis cells (plots + written answers)
Metrics
You must report COCO-style metrics:- mAP (COCO mAP@[0.5:0.95])
- AP50 and AP75
- Recall (COCO AR or a simpler recall estimate)
Objective (default)
You will optimize validation COCO mAP: θmaxmAPval(θ), where θ denotes the hyperparameters under search.If you want to trade off latency, define a scalarized objective: J(θ)=mAPval(θ)−λLatency(θ). In that case, you must define λ and measure latency consistently.
0. Colab setup
- Enable GPU:
Runtime → Change runtime type → GPU - Install packages
- Login to W&B
Copy
# If you need to install packages, do it here (Colab).
# !pip -q install torch torchvision
# !pip -q install pycocotools
# !pip -q install optuna
# !pip -q install wandb
import os, json, random, time
from dataclasses import dataclass, asdict
from typing import Dict, Any, List, Tuple, Optional
import numpy as np
import torch
import torchvision
from torchvision.transforms import functional as F
print("torch:", torch.__version__)
print("torchvision:", torchvision.__version__)
print("cuda available:", torch.cuda.is_available())
Copy
torch: 2.7.1+cu128
torchvision: 0.22.1+cu128
cuda available: True
Copy
# W&B authentication
# In Docker: WANDB_API_KEY is set in the environment automatically.
# In Colab: call wandb.login() interactively.
import wandb
if os.environ.get('WANDB_API_KEY'):
wandb.login(key=os.environ['WANDB_API_KEY'])
print('W&B: authenticated via WANDB_API_KEY env var')
else:
wandb.login()
print('W&B: interactive login')
Copy
[34m[1mwandb[0m: [33mWARNING[0m If you're specifying your api key in code, ensure this code is not shared publicly.
[34m[1mwandb[0m: [33mWARNING[0m Consider setting the WANDB_API_KEY environment variable, or running `wandb login` from the command line.
Copy
[34m[1mwandb[0m: [wandb.login()] Using explicit session credentials for https://api.wandb.ai.
Copy
[34m[1mwandb[0m: No netrc file found, creating one.
Copy
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/vscode/.netrc
Copy
[34m[1mwandb[0m: Currently logged in as: [33mpantelis[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
Copy
W&B: authenticated via WANDB_API_KEY env var
1. Reproducibility (required)
You must fix and log:- random seeds
- dataset split indices
- code version (commit hash, if applicable)
Copy
def set_global_seed(seed: int) -> None:
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
def seed_worker(worker_id: int) -> None:
# Deterministic DataLoader workers
worker_seed = torch.initial_seed() % 2**32
np.random.seed(worker_seed)
random.seed(worker_seed)
BASE_SEED = 1337
set_global_seed(BASE_SEED)
2. Dataset: COCO MiniTrain
Clone the dataset repo and set paths below. COCO MiniTrain repository: https://github.com/giddyyupp/coco-minitrain You will create a deterministic train/val split.Required outputs
train_ids.jsonandval_ids.jsonsaved to disk- logged to W&B as artifacts (optional but encouraged)
Copy
# --- Dataset: COCO MiniTrain ---
#
# Sampling methodology: https://github.com/giddyyupp/coco-minitrain
# Statistically samples N images from COCO 2017 train preserving class/size distributions.
#
# Pre-sampled subsets:
# HF repo : https://huggingface.co/datasets/bryanbocao/coco_minitrain
# Files : coco_minitrain_10k.zip (9 GB) | _15k | _20k | _25k
# Format : YOLO labels + JPEG images (no COCO JSON included)
#
# This cell downloads the 10k subset, then generates a COCO JSON annotation file
# by filtering the official COCO 2017 train annotations to the 10k image IDs.
# For the full assignment run change HF_DATASET_FILE to "coco_minitrain_25k.zip".
import os, time, zipfile, json as _json
import requests as _requests
from huggingface_hub import hf_hub_download
HF_DATASET_REPO = "bryanbocao/coco_minitrain"
HF_DATASET_FILE = "coco_minitrain_10k.zip" # change to _25k for full run
COCO_ANN_URL = "http://images.cocodataset.org/annotations/annotations_trainval2017.zip"
_IS_COLAB = os.path.isdir("/content")
DATASET_BASE = "/content" if _IS_COLAB else "/workspaces/eng-ai-agents/data"
EXTRACT_ROOT = os.path.join(DATASET_BASE, "coco_minitrain")
COCO_MINITRAIN_ROOT = None
IMAGES_DIR = None
ANN_JSON = None
DATASET_READY = False
def _hf_download_with_retry(repo_id, filename, repo_type, local_dir,
max_retries=5, base_wait=60):
for attempt in range(max_retries):
try:
return hf_hub_download(repo_id=repo_id, filename=filename,
repo_type=repo_type, local_dir=local_dir)
except Exception as e:
if "429" in str(e) and attempt < max_retries - 1:
wait = base_wait * (2 ** attempt)
print(f" Rate-limited (attempt {attempt+1}/{max_retries}). Waiting {wait}s ...")
time.sleep(wait)
else:
raise
raise RuntimeError("hf_hub_download: max retries exceeded")
def _build_coco_json(images_dir, full_ann_path, out_path):
"""Filter instances_train2017.json to the image IDs present in images_dir."""
img_files = [f for f in os.listdir(images_dir) if f.lower().endswith(".jpg")]
present_ids = {int(os.path.splitext(f)[0]) for f in img_files}
print(f" Found {len(present_ids)} images in {images_dir}")
print(f" Loading full COCO annotations from {full_ann_path} ...")
with open(full_ann_path) as f:
full = _json.load(f)
imgs = [im for im in full["images"] if im["id"] in present_ids]
anns = [an for an in full["annotations"] if an["image_id"] in present_ids]
mini = {
"info": full.get("info", {}),
"licenses": full.get("licenses", []),
"categories": full["categories"],
"images": imgs,
"annotations": anns,
}
os.makedirs(os.path.dirname(out_path), exist_ok=True)
with open(out_path, "w") as f:
_json.dump(mini, f)
print(f" Wrote {len(imgs)} images / {len(anns)} annotations → {out_path}")
def _ensure_coco_full_annotations(ann_dir):
"""Download and extract official COCO 2017 train annotations if missing.
The COCO zip extracts to an 'annotations/' subdir, so the final path is
ann_dir/annotations/instances_train2017.json.
"""
target = os.path.join(ann_dir, "annotations", "instances_train2017.json")
if os.path.exists(target):
return target
os.makedirs(ann_dir, exist_ok=True)
zip_path = os.path.join(ann_dir, "annotations_trainval2017.zip")
print(f" Downloading COCO 2017 annotations (~253 MB) ...")
with _requests.get(COCO_ANN_URL, stream=True, timeout=120) as r:
r.raise_for_status()
with open(zip_path, "wb") as f:
for chunk in r.iter_content(1 << 20):
f.write(chunk)
print(" Extracting ...")
with zipfile.ZipFile(zip_path) as zf:
zf.extractall(ann_dir)
os.remove(zip_path)
return target
# ── Step 1: locate or download+extract the HF zip ────────────────────────────
extract_dir = os.path.join(EXTRACT_ROOT, HF_DATASET_FILE.replace(".zip", ""))
zip_local = os.path.join(EXTRACT_ROOT, HF_DATASET_FILE)
if os.path.isdir(extract_dir) and os.listdir(extract_dir):
print(f"Cached extraction found: {extract_dir}")
else:
os.makedirs(EXTRACT_ROOT, exist_ok=True)
try:
if os.path.exists(zip_local):
print(f"Zip already downloaded: {zip_local}")
else:
print(f"Downloading {HF_DATASET_FILE} from {HF_DATASET_REPO} ...")
zip_local = _hf_download_with_retry(
repo_id=HF_DATASET_REPO, filename=HF_DATASET_FILE,
repo_type="dataset", local_dir=EXTRACT_ROOT,
)
print(f"Download complete: {zip_local}")
print(f"Extracting to {extract_dir} ...")
os.makedirs(extract_dir, exist_ok=True)
with zipfile.ZipFile(zip_local, "r") as zf:
zf.extractall(extract_dir)
print("Extraction complete.")
except Exception as e:
print(f"HF download/extraction failed: {e}")
extract_dir = None
# ── Step 2: locate train2017 images dir ───────────────────────────────────────
if extract_dir and os.path.isdir(extract_dir):
# Zip extracts to coco_minitrain_10k/coco_minitrain_10k/images/train2017/
train_imgs = os.path.join(extract_dir, os.path.basename(extract_dir),
"images", "train2017")
if not os.path.isdir(train_imgs):
best_dir, best_n = extract_dir, 0
for dp, _, files in os.walk(extract_dir):
n = sum(1 for f in files if f.lower().endswith(".jpg"))
if n > best_n:
best_dir, best_n = dp, n
train_imgs = best_dir if best_n > 10 else None
if train_imgs:
IMAGES_DIR = train_imgs
# ── Step 3: build COCO JSON if not present ─────────────────────────
ann_dir = os.path.join(extract_dir, "annotations")
ann_json = os.path.join(ann_dir, "instances_minitrain.json")
if not os.path.exists(ann_json):
full_ann_cache = os.path.join(EXTRACT_ROOT, "coco_full_annotations")
full_ann_path = _ensure_coco_full_annotations(full_ann_cache)
_build_coco_json(IMAGES_DIR, full_ann_path, ann_json)
else:
print(f"Annotation JSON already exists: {ann_json}")
ANN_JSON = ann_json
COCO_MINITRAIN_ROOT = extract_dir
DATASET_READY = True
print(f"Dataset ready:")
print(f" Root: {COCO_MINITRAIN_ROOT}")
print(f" Annotations: {ANN_JSON}")
print(f" Images: {IMAGES_DIR}")
else:
print(f"WARNING: could not locate train2017 images inside {extract_dir}")
# ── Step 4: fallback — git clone (annotations only) ──────────────────────────
if not DATASET_READY:
print("\nFalling back to git clone of coco-minitrain (annotations only).")
for p in ["coco-minitrain", "/content/coco-minitrain",
os.path.expanduser("~/coco-minitrain")]:
if os.path.isdir(p):
COCO_MINITRAIN_ROOT = p
break
if COCO_MINITRAIN_ROOT is None:
os.system("git clone --depth 1 https://github.com/giddyyupp/coco-minitrain.git")
COCO_MINITRAIN_ROOT = "coco-minitrain"
IMAGES_DIR = os.path.join(COCO_MINITRAIN_ROOT, "images")
ANN_JSON = os.path.join(COCO_MINITRAIN_ROOT, "annotations", "instances_minitrain.json")
if os.path.exists(ANN_JSON):
DATASET_READY = True
print(f"Using local clone: {COCO_MINITRAIN_ROOT}")
else:
print("WARNING: Dataset not ready — no annotation file found.")
print(f"\nDATASET_READY : {DATASET_READY}")
if DATASET_READY:
print(f"ANN_JSON : {ANN_JSON}")
print(f"IMAGES_DIR : {IMAGES_DIR}")
Copy
/workspaces/eng-ai-agents/.venv/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
Copy
Cached extraction found: /workspaces/eng-ai-agents/data/coco_minitrain/coco_minitrain_10k
Found 10000 images in /workspaces/eng-ai-agents/data/coco_minitrain/coco_minitrain_10k/coco_minitrain_10k/images/train2017
Loading full COCO annotations from /workspaces/eng-ai-agents/data/coco_minitrain/coco_full_annotations/annotations/instances_train2017.json ...
Copy
Wrote 10000 images / 72944 annotations → /workspaces/eng-ai-agents/data/coco_minitrain/coco_minitrain_10k/annotations/instances_minitrain.json
Dataset ready:
Root: /workspaces/eng-ai-agents/data/coco_minitrain/coco_minitrain_10k
Annotations: /workspaces/eng-ai-agents/data/coco_minitrain/coco_minitrain_10k/annotations/instances_minitrain.json
Images: /workspaces/eng-ai-agents/data/coco_minitrain/coco_minitrain_10k/coco_minitrain_10k/images/train2017
DATASET_READY : True
ANN_JSON : /workspaces/eng-ai-agents/data/coco_minitrain/coco_minitrain_10k/annotations/instances_minitrain.json
IMAGES_DIR : /workspaces/eng-ai-agents/data/coco_minitrain/coco_minitrain_10k/coco_minitrain_10k/images/train2017
Copy
from pycocotools.coco import COCO
from PIL import Image
coco = COCO(ANN_JSON)
img_ids = sorted(coco.getImgIds())
print("num images:", len(img_ids))
# Deterministic split
val_frac = 0.2
rng = np.random.default_rng(BASE_SEED)
perm = rng.permutation(len(img_ids))
n_val = int(len(img_ids) * val_frac)
val_ids = [img_ids[i] for i in perm[:n_val]]
train_ids = [img_ids[i] for i in perm[n_val:]]
print("train:", len(train_ids), "val:", len(val_ids))
SPLIT_DIR = os.path.join(COCO_MINITRAIN_ROOT, "splits")
os.makedirs(SPLIT_DIR, exist_ok=True)
with open(os.path.join(SPLIT_DIR, "train_ids.json"), "w") as f:
json.dump(train_ids, f)
with open(os.path.join(SPLIT_DIR, "val_ids.json"), "w") as f:
json.dump(val_ids, f)
print("Saved split ids to:", SPLIT_DIR)
Copy
loading annotations into memory...
Copy
Done (t=1.42s)
creating index...
index created!
num images: 10000
train: 8000 val: 2000
Saved split ids to: /workspaces/eng-ai-agents/data/coco_minitrain/coco_minitrain_10k/splits
3. PyTorch dataset and transforms
You must keep transforms simple initially. Use augmentations only after baseline correctness is established. Recommended minimal transforms:- Convert to tensor
- (Optional) resize to a fixed shorter side (be consistent across runs)
Copy
from torch.utils.data import Dataset, DataLoader
class CocoMiniTrainDataset(Dataset):
def __init__(self, coco: COCO, image_dir: str, img_ids: List[int], train: bool = True):
self.coco = coco
self.image_dir = image_dir
self.img_ids = img_ids
self.train = train
def __len__(self) -> int:
return len(self.img_ids)
def __getitem__(self, idx: int):
img_id = self.img_ids[idx]
img_info = self.coco.loadImgs([img_id])[0]
img_path = os.path.join(self.image_dir, img_info["file_name"])
image = Image.open(img_path).convert("RGB")
ann_ids = self.coco.getAnnIds(imgIds=[img_id], iscrowd=None)
anns = self.coco.loadAnns(ann_ids)
boxes = []
labels = []
areas = []
iscrowd = []
for a in anns:
# COCO bbox: [x,y,w,h] -> [x1,y1,x2,y2]
x, y, w, h = a["bbox"]
if w <= 1 or h <= 1:
continue
boxes.append([x, y, x + w, y + h])
labels.append(a["category_id"])
areas.append(a.get("area", w * h))
iscrowd.append(a.get("iscrowd", 0))
boxes = torch.as_tensor(boxes, dtype=torch.float32)
labels = torch.as_tensor(labels, dtype=torch.int64)
areas = torch.as_tensor(areas, dtype=torch.float32)
iscrowd = torch.as_tensor(iscrowd, dtype=torch.int64)
image_t = F.to_tensor(image)
target = {
"boxes": boxes,
"labels": labels,
"image_id": torch.tensor([img_id]),
"area": areas,
"iscrowd": iscrowd,
}
return image_t, target
def collate_fn(batch):
return tuple(zip(*batch))
train_ds = CocoMiniTrainDataset(coco, IMAGES_DIR, train_ids, train=True)
val_ds = CocoMiniTrainDataset(coco, IMAGES_DIR, val_ids, train=False)
print("train len:", len(train_ds), "val len:", len(val_ds))
Copy
train len: 8000 val len: 2000
Copy
BATCH_SIZE = 2 # adjust to GPU memory
NUM_WORKERS = 2
g = torch.Generator()
g.manual_seed(BASE_SEED)
train_loader = DataLoader(
train_ds, batch_size=BATCH_SIZE, shuffle=True,
num_workers=NUM_WORKERS, collate_fn=collate_fn,
worker_init_fn=seed_worker, generator=g
)
val_loader = DataLoader(
val_ds, batch_size=1, shuffle=False,
num_workers=NUM_WORKERS, collate_fn=collate_fn,
worker_init_fn=seed_worker, generator=g
)
next(iter(train_loader))[0][0].shape
Copy
torch.Size([3, 439, 640])
4. Model: Faster R-CNN (torchvision)
You will use:torchvision.models.detection.fasterrcnn_resnet50_fpn
Copy
from torchvision.models.detection import fasterrcnn_resnet50_fpn
def build_model(num_classes: Optional[int] = None):
# COCO has 80 categories (plus background internally).
# In torchvision, num_classes includes background.
# If you want to adapt to a different label space, you must remap category IDs.
model = fasterrcnn_resnet50_fpn(weights="DEFAULT")
if num_classes is not None:
# Replace the box predictor head
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = torchvision.models.detection.faster_rcnn.FastRCNNPredictor(in_features, num_classes)
return model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = build_model().to(device)
print("Model loaded on:", device)
Copy
Downloading: "https://download.pytorch.org/models/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth" to /home/vscode/.cache/torch/hub/checkpoints/fasterrcnn_resnet50_fpn_coco-258fb6c6.pth
Copy
0%| | 0.00/160M [00:00<?, ?B/s]
Copy
3%|▎ | 4.12M/160M [00:00<00:03, 42.4MB/s]
Copy
6%|▌ | 9.00M/160M [00:00<00:03, 45.9MB/s]
Copy
9%|▉ | 15.0M/160M [00:00<00:02, 53.4MB/s]
Copy
13%|█▎ | 20.2M/160M [00:00<00:02, 53.9MB/s]
Copy
16%|█▌ | 25.5M/160M [00:00<00:02, 52.4MB/s]
Copy
19%|█▉ | 30.8M/160M [00:00<00:02, 53.2MB/s]
Copy
22%|██▏ | 35.9M/160M [00:00<00:02, 50.1MB/s]
Copy
26%|██▌ | 40.8M/160M [00:00<00:02, 43.4MB/s]
Copy
29%|██▉ | 46.6M/160M [00:00<00:02, 48.3MB/s]
Copy
32%|███▏ | 51.5M/160M [00:01<00:02, 48.8MB/s]
Copy
36%|███▌ | 57.9M/160M [00:01<00:02, 51.8MB/s]
Copy
39%|███▉ | 63.0M/160M [00:01<00:01, 51.8MB/s]
Copy
43%|████▎ | 68.0M/160M [00:01<00:01, 51.4MB/s]
Copy
46%|████▋ | 74.1M/160M [00:01<00:01, 53.9MB/s]
Copy
50%|█████ | 80.4M/160M [00:01<00:01, 57.1MB/s]
Copy
54%|█████▍ | 86.6M/160M [00:01<00:01, 59.4MB/s]
Copy
58%|█████▊ | 92.4M/160M [00:01<00:01, 57.5MB/s]
Copy
61%|██████▏ | 98.0M/160M [00:01<00:01, 53.9MB/s]
Copy
65%|██████▍ | 103M/160M [00:02<00:01, 51.1MB/s]
Copy
68%|██████▊ | 108M/160M [00:02<00:01, 49.8MB/s]
Copy
71%|███████ | 114M/160M [00:02<00:00, 50.0MB/s]
Copy
74%|███████▍ | 119M/160M [00:02<00:00, 49.7MB/s]
Copy
78%|███████▊ | 124M/160M [00:02<00:00, 50.6MB/s]
Copy
81%|████████ | 129M/160M [00:02<00:00, 51.7MB/s]
Copy
84%|████████▍ | 134M/160M [00:02<00:00, 47.9MB/s]
Copy
89%|████████▊ | 142M/160M [00:02<00:00, 55.8MB/s]
Copy
92%|█████████▏| 147M/160M [00:02<00:00, 53.8MB/s]
Copy
96%|█████████▌| 154M/160M [00:03<00:00, 58.0MB/s]
Copy
100%|█████████▉| 159M/160M [00:03<00:00, 57.7MB/s]
Copy
100%|██████████| 160M/160M [00:03<00:00, 52.5MB/s]
Copy
Model loaded on: cuda
5. Training and evaluation
You will implement:- a training loop that logs loss components
- COCO evaluation via
pycocotools.cocoeval.COCOeval
Important notes
- COCO category IDs are not always contiguous. Torchvision expects contiguous class indices when you replace heads.
- For this assignment you will keep the default COCO label space and use the pretrained COCO model, then fine-tune on COCO MiniTrain.
Copy
from pycocotools.cocoeval import COCOeval
@torch.no_grad()
def evaluate_coco_map(model, coco_gt: COCO, data_loader: DataLoader, max_dets: int = 100):
model.eval()
results = []
for images, targets in data_loader:
images = [img.to(device) for img in images]
outputs = model(images)
for out, tgt in zip(outputs, targets):
img_id = int(tgt["image_id"].item())
boxes = out["boxes"].detach().cpu().numpy() # [N,4] x1,y1,x2,y2
scores = out["scores"].detach().cpu().numpy()
labels = out["labels"].detach().cpu().numpy()
# Convert to COCO format
for b, s, c in zip(boxes, scores, labels):
x1, y1, x2, y2 = b.tolist()
w = max(0.0, x2 - x1)
h = max(0.0, y2 - y1)
results.append({
"image_id": img_id,
"category_id": int(c),
"bbox": [x1, y1, w, h],
"score": float(s),
})
if len(results) == 0:
return {"mAP": 0.0, "AP50": 0.0, "AP75": 0.0}
coco_dt = coco_gt.loadRes(results)
coco_eval = COCOeval(coco_gt, coco_dt, iouType="bbox")
coco_eval.params.maxDets = [max_dets, max_dets, max_dets]
coco_eval.evaluate()
coco_eval.accumulate()
coco_eval.summarize()
# COCOeval.stats indices:
# 0: AP IoU=0.50:0.95
# 1: AP IoU=0.50
# 2: AP IoU=0.75
mAP = float(coco_eval.stats[0])
AP50 = float(coco_eval.stats[1])
AP75 = float(coco_eval.stats[2])
return {"mAP": mAP, "AP50": AP50, "AP75": AP75}
def train_one_epoch(model, optimizer, data_loader: DataLoader, epoch: int, max_norm: float = 0.0):
model.train()
loss_sums = {"loss": 0.0}
n = 0
for images, targets in data_loader:
images = [img.to(device) for img in images]
targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
loss_dict = model(images, targets)
losses = sum(loss for loss in loss_dict.values())
optimizer.zero_grad(set_to_none=True)
losses.backward()
if max_norm and max_norm > 0:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
optimizer.step()
# accumulate
n += 1
loss_sums["loss"] += float(losses.item())
for k, v in loss_dict.items():
loss_sums[k] = loss_sums.get(k, 0.0) + float(v.item())
for k in loss_sums:
loss_sums[k] /= max(1, n)
return loss_sums
6. Baseline run (required)
Run a baseline training job and log to W&B. Required:- train loss curves (total + components)
- validation metrics: mAP, AP50, AP75
- save the model checkpoint
Copy
os.makedirs("checkpoints", exist_ok=True)
from torch.optim import SGD
from torch.optim.lr_scheduler import StepLR
BASELINE_CFG = {
"seed": BASE_SEED,
"epochs": int(os.environ.get("BASELINE_EPOCHS", 3)),
"lr": 0.005,
"momentum": 0.9,
"weight_decay": 1e-4,
"grad_clip_norm": 0.0,
"step_size": 6,
"gamma": 0.1,
"batch_size": BATCH_SIZE,
}
set_global_seed(BASELINE_CFG["seed"])
run = wandb.init(
project="faster-rcnn-optuna-coco-minitrain",
name="baseline",
config=BASELINE_CFG
)
model = build_model().to(device)
optimizer = SGD(
model.parameters(),
lr=BASELINE_CFG["lr"],
momentum=BASELINE_CFG["momentum"],
weight_decay=BASELINE_CFG["weight_decay"]
)
scheduler = StepLR(optimizer, step_size=BASELINE_CFG["step_size"], gamma=BASELINE_CFG["gamma"])
for epoch in range(BASELINE_CFG["epochs"]):
t0 = time.time()
losses = train_one_epoch(model, optimizer, train_loader, epoch, max_norm=BASELINE_CFG["grad_clip_norm"])
scheduler.step()
metrics = evaluate_coco_map(model, coco, val_loader)
log_dict = {**losses, **{f"val_{k}": v for k, v in metrics.items()}, "epoch": epoch, "lr": scheduler.get_last_lr()[0], "epoch_time_s": time.time()-t0}
wandb.log(log_dict)
print(f"Epoch {epoch}: loss={losses['loss']:.4f} val_mAP={metrics['mAP']:.4f}")
BASELINE_CKPT = os.path.join("checkpoints", "baseline_fasterrcnn.pt")
torch.save(model.state_dict(), BASELINE_CKPT)
wandb.save(BASELINE_CKPT)
wandb.finish()
print("Saved:", BASELINE_CKPT)
Copy
[34m[1mwandb[0m: setting up run mm333dqx
Copy
[34m[1mwandb[0m: Tracking run with wandb version 0.25.0
Copy
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/workspaces/eng-ai-agents/wandb/run-20260305_165657-mm333dqx[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
Copy
[34m[1mwandb[0m: Syncing run [33mbaseline[0m
Copy
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/pantelis/faster-rcnn-optuna-coco-minitrain[0m
Copy
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/pantelis/faster-rcnn-optuna-coco-minitrain/runs/mm333dqx[0m
Copy
Loading and preparing results...
Copy
DONE (t=0.50s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
Copy
DONE (t=21.59s).
Accumulating evaluation results...
Copy
DONE (t=4.70s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.055
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.096
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.059
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.035
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.062
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.068
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.083
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.083
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.083
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.046
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.087
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.109
Copy
Epoch 0: loss=0.6911 val_mAP=0.0555
Copy
Loading and preparing results...
DONE (t=0.18s)
creating index...
Copy
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
Copy
DONE (t=24.71s).
Accumulating evaluation results...
Copy
DONE (t=5.54s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.052
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.091
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.054
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.034
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.059
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.060
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.085
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.085
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.085
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.051
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.093
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.107
Copy
Epoch 1: loss=0.6072 val_mAP=0.0521
Copy
Loading and preparing results...
DONE (t=0.10s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
Copy
DONE (t=17.82s).
Accumulating evaluation results...
Copy
DONE (t=3.21s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.057
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.099
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.060
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.035
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.063
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.071
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.086
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.086
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.086
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.051
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.091
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.114
Copy
Epoch 2: loss=0.5561 val_mAP=0.0574
Copy
[34m[1mwandb[0m: [33mWARNING[0m Symlinked 1 file into the W&B run directory; call wandb.save again to sync new files.
Copy
[34m[1mwandb[0m: updating run metadata
Copy
[34m[1mwandb[0m: uploading checkpoints/baseline_fasterrcnn.pt
Copy
[34m[1mwandb[0m: uploading checkpoints/baseline_fasterrcnn.pt; uploading history steps 2-2, summary
Copy
[34m[1mwandb[0m: uploading data
Copy
[34m[1mwandb[0m:
[34m[1mwandb[0m: Run history:
[34m[1mwandb[0m: epoch ▁▅█
[34m[1mwandb[0m: epoch_time_s █▂▁
[34m[1mwandb[0m: loss █▄▁
[34m[1mwandb[0m: loss_box_reg █▅▁
[34m[1mwandb[0m: loss_classifier █▃▁
[34m[1mwandb[0m: loss_objectness █▄▁
[34m[1mwandb[0m: loss_rpn_box_reg █▄▁
[34m[1mwandb[0m: lr ▁▁▁
[34m[1mwandb[0m: val_AP50 ▆▁█
[34m[1mwandb[0m: val_AP75 ▇▁█
[34m[1mwandb[0m: +1 ...
[34m[1mwandb[0m:
[34m[1mwandb[0m: Run summary:
[34m[1mwandb[0m: epoch 2
[34m[1mwandb[0m: epoch_time_s 1398.17586
[34m[1mwandb[0m: loss 0.55609
[34m[1mwandb[0m: loss_box_reg 0.24568
[34m[1mwandb[0m: loss_classifier 0.21955
[34m[1mwandb[0m: loss_objectness 0.03035
[34m[1mwandb[0m: loss_rpn_box_reg 0.06052
[34m[1mwandb[0m: lr 0.005
[34m[1mwandb[0m: val_AP50 0.09916
[34m[1mwandb[0m: val_AP75 0.05954
[34m[1mwandb[0m: +1 ...
[34m[1mwandb[0m:
Copy
[34m[1mwandb[0m: 🚀 View run [33mbaseline[0m at: [34m[4mhttps://wandb.ai/pantelis/faster-rcnn-optuna-coco-minitrain/runs/mm333dqx[0m
[34m[1mwandb[0m: ⭐️ View project at: [34m[4mhttps://wandb.ai/pantelis/faster-rcnn-optuna-coco-minitrain[0m
[34m[1mwandb[0m: Synced 4 W&B file(s), 0 media file(s), 0 artifact file(s) and 1 other file(s)
Copy
[34m[1mwandb[0m: Find logs at: [35m[1m./wandb/run-20260305_165657-mm333dqx/logs[0m
Copy
Saved: checkpoints/baseline_fasterrcnn.pt
7. Optuna + W&B: stage-wise hyperparameter optimization (required)
You will run Optuna studies in stages.- Stage 1: optimizer dynamics (LR, weight decay, momentum, warmup)
- Stage 2: RPN hyperparameters
- Stage 3: RoI head hyperparameters
- Stage 4: post-processing calibration (no training)
TPESampler- pruning (
MedianPrunerorHyperbandPruner)
- Train for a small budget (e.g., 3–5 epochs),
- Report intermediate validation mAP via
trial.report(...), - Allow Optuna to prune underperforming trials.
Copy
import optuna
def make_optimizer(model, lr: float, momentum: float, weight_decay: float):
return SGD(model.parameters(), lr=lr, momentum=momentum, weight_decay=weight_decay)
def objective_stage1(trial: optuna.Trial) -> float:
cfg = {
"stage": "stage1_opt",
"seed": int(trial.suggest_int("seed", 1, 10_000)),
"epochs": int(trial.suggest_int("epochs", 3, 5)),
"lr": float(trial.suggest_float("lr", 1e-5, 1e-2, log=True)),
"weight_decay": float(trial.suggest_float("weight_decay", 1e-6, 1e-2, log=True)),
"momentum": float(trial.suggest_float("momentum", 0.8, 0.99)),
"grad_clip_norm": float(trial.suggest_float("grad_clip_norm", 0.0, 5.0)),
}
set_global_seed(cfg["seed"])
run = wandb.init(
project="faster-rcnn-optuna-coco-minitrain",
name=f"optuna_stage1_trial_{trial.number:04d}",
config=cfg,
reinit=True
)
model = build_model().to(device)
optimizer = make_optimizer(model, cfg["lr"], cfg["momentum"], cfg["weight_decay"])
best_map = -1.0
for epoch in range(cfg["epochs"]):
losses = train_one_epoch(model, optimizer, train_loader, epoch, max_norm=cfg["grad_clip_norm"])
metrics = evaluate_coco_map(model, coco, val_loader)
val_map = metrics["mAP"]
best_map = max(best_map, val_map)
wandb.log({**losses, **{f"val_{k}": v for k, v in metrics.items()}, "epoch": epoch})
trial.report(val_map, step=epoch)
if trial.should_prune():
wandb.log({"pruned": 1, "best_val_mAP": best_map})
wandb.finish()
raise optuna.exceptions.TrialPruned()
wandb.log({"best_val_mAP": best_map, "pruned": 0})
wandb.finish()
return best_map
sampler = optuna.samplers.TPESampler(seed=BASE_SEED)
pruner = optuna.pruners.MedianPruner(n_startup_trials=5, n_warmup_steps=1)
study_stage1 = optuna.create_study(direction="maximize", sampler=sampler, pruner=pruner, study_name="stage1_opt")
Copy
[32m[I 2026-03-05 18:09:24,527][0m A new study created in memory with name: stage1_opt[0m
Copy
# Run Stage 1 study
N_TRIALS_STAGE1 = int(os.environ.get("HPO_TRIALS", 3)) # default 3 for demo, 30 for assignment
study_stage1.optimize(objective_stage1, n_trials=N_TRIALS_STAGE1, show_progress_bar=True)
print("Best Stage 1:", study_stage1.best_value)
print("Best params:", study_stage1.best_params)
Copy
0%| | 0/3 [00:00<?, ?it/s]
Copy
[34m[1mwandb[0m: [33mWARNING[0m Using a boolean value for 'reinit' is deprecated. Use 'return_previous' or 'finish_previous' instead.
Copy
[34m[1mwandb[0m: setting up run ftugc11l
Copy
[34m[1mwandb[0m: Tracking run with wandb version 0.25.0
Copy
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/workspaces/eng-ai-agents/wandb/run-20260305_180924-ftugc11l[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
Copy
[34m[1mwandb[0m: Syncing run [33moptuna_stage1_trial_0000[0m
Copy
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/pantelis/faster-rcnn-optuna-coco-minitrain[0m
Copy
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/pantelis/faster-rcnn-optuna-coco-minitrain/runs/ftugc11l[0m
Copy
Loading and preparing results...
DONE (t=0.06s)
creating index...
index created!
Running per image evaluation...
Evaluate annotation type *bbox*
Copy
DONE (t=16.28s).
Accumulating evaluation results...
Copy
DONE (t=2.59s).
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.103
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.151
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.116
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.071
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.114
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.130
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.121
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.121
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.121
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.085
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.129
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.153
Copy
Loading and preparing results...
8. Stage 2: RPN tuning (required)
Fix the best Stage 1 hyperparameters, then tune RPN knobs that affect proposal quality and recall. Suggested search space:rpn_nms_threshin [0.5,0.9]rpn_pre_nms_topkin [1000,4000]rpn_post_nms_topkin [300,2000]rpn_fg_iou_threshin [0.5,0.8]rpn_bg_iou_threshin [0.0,0.4]rpn_batch_size_per_imagein [128,512]rpn_positive_fractionin [0.25,0.75]
model.rpn.*fields (where supported)
Copy
def apply_rpn_hparams(model, cfg: Dict[str, Any]):
# Apply what torchvision exposes on your version.
# These attributes exist in torchvision's RegionProposalNetwork in most versions.
rpn = model.rpn
if "rpn_nms_thresh" in cfg: rpn.nms_thresh = float(cfg["rpn_nms_thresh"])
if "rpn_pre_nms_topk" in cfg: rpn.pre_nms_top_n["training"] = int(cfg["rpn_pre_nms_topk"])
if "rpn_post_nms_topk" in cfg: rpn.post_nms_top_n["training"] = int(cfg["rpn_post_nms_topk"])
if "rpn_pre_nms_topk" in cfg: rpn.pre_nms_top_n["testing"] = int(cfg["rpn_pre_nms_topk"])
if "rpn_post_nms_topk" in cfg: rpn.post_nms_top_n["testing"] = int(cfg["rpn_post_nms_topk"])
if "rpn_fg_iou_thresh" in cfg: rpn.fg_iou_thresh = float(cfg["rpn_fg_iou_thresh"])
if "rpn_bg_iou_thresh" in cfg: rpn.bg_iou_thresh = float(cfg["rpn_bg_iou_thresh"])
if "rpn_batch_size_per_image" in cfg: rpn.batch_size_per_image = int(cfg["rpn_batch_size_per_image"])
if "rpn_positive_fraction" in cfg: rpn.positive_fraction = float(cfg["rpn_positive_fraction"])
def objective_stage2(trial: optuna.Trial) -> float:
# Fix Stage 1 best optimizer params
best1 = study_stage1.best_params
cfg = {
"stage": "stage2_rpn",
"seed": int(trial.suggest_int("seed", 1, 10_000)),
"epochs": 4, # keep small for HPO budget
"lr": float(best1["lr"]),
"weight_decay": float(best1["weight_decay"]),
"momentum": float(best1["momentum"]),
"grad_clip_norm": float(best1.get("grad_clip_norm", 0.0)),
# RPN search
"rpn_nms_thresh": float(trial.suggest_float("rpn_nms_thresh", 0.5, 0.9)),
"rpn_pre_nms_topk": int(trial.suggest_int("rpn_pre_nms_topk", 1000, 4000)),
"rpn_post_nms_topk": int(trial.suggest_int("rpn_post_nms_topk", 300, 2000)),
"rpn_fg_iou_thresh": float(trial.suggest_float("rpn_fg_iou_thresh", 0.5, 0.8)),
"rpn_bg_iou_thresh": float(trial.suggest_float("rpn_bg_iou_thresh", 0.0, 0.4)),
"rpn_batch_size_per_image": int(trial.suggest_int("rpn_batch_size_per_image", 128, 512)),
"rpn_positive_fraction": float(trial.suggest_float("rpn_positive_fraction", 0.25, 0.75)),
}
set_global_seed(cfg["seed"])
run = wandb.init(
project="faster-rcnn-optuna-coco-minitrain",
name=f"optuna_stage2_trial_{trial.number:04d}",
config=cfg,
reinit=True
)
model = build_model().to(device)
apply_rpn_hparams(model, cfg)
optimizer = make_optimizer(model, cfg["lr"], cfg["momentum"], cfg["weight_decay"])
best_map = -1.0
for epoch in range(cfg["epochs"]):
losses = train_one_epoch(model, optimizer, train_loader, epoch, max_norm=cfg["grad_clip_norm"])
metrics = evaluate_coco_map(model, coco, val_loader)
val_map = metrics["mAP"]
best_map = max(best_map, val_map)
wandb.log({**losses, **{f"val_{k}": v for k, v in metrics.items()}, "epoch": epoch})
trial.report(val_map, step=epoch)
if trial.should_prune():
wandb.log({"pruned": 1, "best_val_mAP": best_map})
wandb.finish()
raise optuna.exceptions.TrialPruned()
wandb.log({"best_val_mAP": best_map, "pruned": 0})
wandb.finish()
return best_map
study_stage2 = optuna.create_study(direction="maximize", sampler=sampler, pruner=pruner, study_name="stage2_rpn")
Copy
# Run Stage 2 study
N_TRIALS_STAGE2 = int(os.environ.get("HPO_TRIALS", 3)) # default 3 for demo, 30 for assignment
study_stage2.optimize(objective_stage2, n_trials=N_TRIALS_STAGE2, show_progress_bar=True)
print("Best Stage 2:", study_stage2.best_value)
print("Best params:", study_stage2.best_params)
9. Stage 3: RoI head tuning (required)
Fix Stage 1+2 best configuration and tune RoI head sampling and loss weighting. Suggested search space:roi_batch_size_per_imagein [128,512]roi_positive_fractionin [0.1,0.5]cls_loss_weightin [0.5,2.0]box_loss_weightin [0.5,2.0]
- Torchvision ROIHeads exposes sampler parameters.
- Loss weights might require applying weights to loss terms manually (by scaling
loss_dictbefore summing). You will implement that by creating a customtrain_one_epoch_weightedbelow.
Copy
def apply_roi_hparams(model, cfg: Dict[str, Any]):
roi = model.roi_heads
if "roi_batch_size_per_image" in cfg: roi.batch_size_per_image = int(cfg["roi_batch_size_per_image"])
if "roi_positive_fraction" in cfg: roi.positive_fraction = float(cfg["roi_positive_fraction"])
def train_one_epoch_weighted(model, optimizer, data_loader: DataLoader, epoch: int, max_norm: float, cls_w: float, box_w: float):
model.train()
loss_sums = {"loss": 0.0}
n = 0
for images, targets in data_loader:
images = [img.to(device) for img in images]
targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
loss_dict = model(images, targets)
# Scale RoI losses; keep RPN terms unscaled by default.
if "loss_classifier" in loss_dict:
loss_dict["loss_classifier"] = loss_dict["loss_classifier"] * cls_w
if "loss_box_reg" in loss_dict:
loss_dict["loss_box_reg"] = loss_dict["loss_box_reg"] * box_w
losses = sum(loss for loss in loss_dict.values())
optimizer.zero_grad(set_to_none=True)
losses.backward()
if max_norm and max_norm > 0:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
optimizer.step()
n += 1
loss_sums["loss"] += float(losses.item())
for k, v in loss_dict.items():
loss_sums[k] = loss_sums.get(k, 0.0) + float(v.item())
for k in loss_sums:
loss_sums[k] /= max(1, n)
return loss_sums
def objective_stage3(trial: optuna.Trial) -> float:
best1 = study_stage1.best_params
best2 = study_stage2.best_params
cfg = {
"stage": "stage3_roi",
"seed": int(trial.suggest_int("seed", 1, 10_000)),
"epochs": 4,
"lr": float(best1["lr"]),
"weight_decay": float(best1["weight_decay"]),
"momentum": float(best1["momentum"]),
"grad_clip_norm": float(best1.get("grad_clip_norm", 0.0)),
# RPN fixed (best2)
**{k: best2[k] for k in best2 if k.startswith("rpn_")},
# RoI search
"roi_batch_size_per_image": int(trial.suggest_int("roi_batch_size_per_image", 128, 512)),
"roi_positive_fraction": float(trial.suggest_float("roi_positive_fraction", 0.1, 0.5)),
"cls_loss_weight": float(trial.suggest_float("cls_loss_weight", 0.5, 2.0)),
"box_loss_weight": float(trial.suggest_float("box_loss_weight", 0.5, 2.0)),
}
set_global_seed(cfg["seed"])
run = wandb.init(
project="faster-rcnn-optuna-coco-minitrain",
name=f"optuna_stage3_trial_{trial.number:04d}",
config=cfg,
reinit=True
)
model = build_model().to(device)
apply_rpn_hparams(model, cfg)
apply_roi_hparams(model, cfg)
optimizer = make_optimizer(model, cfg["lr"], cfg["momentum"], cfg["weight_decay"])
best_map = -1.0
for epoch in range(cfg["epochs"]):
losses = train_one_epoch_weighted(
model, optimizer, train_loader, epoch,
max_norm=cfg["grad_clip_norm"],
cls_w=cfg["cls_loss_weight"],
box_w=cfg["box_loss_weight"],
)
metrics = evaluate_coco_map(model, coco, val_loader)
val_map = metrics["mAP"]
best_map = max(best_map, val_map)
wandb.log({**losses, **{f"val_{k}": v for k, v in metrics.items()}, "epoch": epoch})
trial.report(val_map, step=epoch)
if trial.should_prune():
wandb.log({"pruned": 1, "best_val_mAP": best_map})
wandb.finish()
raise optuna.exceptions.TrialPruned()
wandb.log({"best_val_mAP": best_map, "pruned": 0})
wandb.finish()
return best_map
study_stage3 = optuna.create_study(direction="maximize", sampler=sampler, pruner=pruner, study_name="stage3_roi")
Copy
# Run Stage 3 study
N_TRIALS_STAGE3 = int(os.environ.get("HPO_TRIALS", 3)) # default 3 for demo, 30 for assignment
study_stage3.optimize(objective_stage3, n_trials=N_TRIALS_STAGE3, show_progress_bar=True)
print("Best Stage 3:", study_stage3.best_value)
print("Best params:", study_stage3.best_params)
10. Stage 4: post-processing calibration (required)
You will tune score threshold and NMS IoU threshold without retraining. Suggested ranges:score_threshin [0.01,0.5]box_nms_threshin [0.3,0.7]
model.roi_heads.score_threshmodel.roi_heads.nms_threshmodel.roi_heads.detections_per_img
- Train one final model using the best Stage 1+2+3 configuration (longer epochs, e.g., 10–15).
- Run an Optuna study that only changes post-processing parameters and evaluates on val.
Copy
def apply_postprocess_hparams(model, cfg: Dict[str, Any]):
roi = model.roi_heads
if "score_thresh" in cfg: roi.score_thresh = float(cfg["score_thresh"])
if "box_nms_thresh" in cfg: roi.nms_thresh = float(cfg["box_nms_thresh"])
if "detections_per_img" in cfg: roi.detections_per_img = int(cfg["detections_per_img"])
def train_final_model(best_cfg: Dict[str, Any], epochs: int = 12, seed: int = 2026) -> str:
set_global_seed(seed)
run = wandb.init(
project="faster-rcnn-optuna-coco-minitrain",
name=f"final_train_seed_{seed}",
config={**best_cfg, "final_epochs": epochs, "final_seed": seed},
reinit=True
)
model = build_model().to(device)
apply_rpn_hparams(model, best_cfg)
apply_roi_hparams(model, best_cfg)
optimizer = make_optimizer(model, best_cfg["lr"], best_cfg["momentum"], best_cfg["weight_decay"])
for epoch in range(epochs):
losses = train_one_epoch_weighted(
model, optimizer, train_loader, epoch,
max_norm=best_cfg.get("grad_clip_norm", 0.0),
cls_w=best_cfg.get("cls_loss_weight", 1.0),
box_w=best_cfg.get("box_loss_weight", 1.0),
)
metrics = evaluate_coco_map(model, coco, val_loader)
wandb.log({**losses, **{f"val_{k}": v for k, v in metrics.items()}, "epoch": epoch})
ckpt = os.path.join("checkpoints", f"final_fasterrcnn_seed_{seed}.pt")
torch.save(model.state_dict(), ckpt)
wandb.save(ckpt)
wandb.finish()
return ckpt
# Compose best config from Stage 1-3
best_cfg = {}
best_cfg.update(study_stage1.best_params)
best_cfg.update({k: v for k, v in study_stage2.best_params.items() if k.startswith("rpn_")})
best_cfg.update({k: v for k, v in study_stage3.best_params.items() if k.startswith("roi_") or k.endswith("_weight")})
# Ensure required optimizer keys exist
# (names differ across studies; normalize to expected keys)
# Stage1 keys are: lr, weight_decay, momentum, grad_clip_norm
# Keep them as is.
print("Best combined cfg:", best_cfg)
FINAL_CKPT = train_final_model(best_cfg, epochs=12, seed=2026)
print("Final ckpt:", FINAL_CKPT)
Copy
@torch.no_grad()
def evaluate_with_postprocess(model, score_thresh: float, nms_thresh: float, dets_per_img: int = 100):
apply_postprocess_hparams(model, {"score_thresh": score_thresh, "box_nms_thresh": nms_thresh, "detections_per_img": dets_per_img})
return evaluate_coco_map(model, coco, val_loader)
def objective_stage4(trial: optuna.Trial) -> float:
cfg = {
"stage": "stage4_post",
"score_thresh": float(trial.suggest_float("score_thresh", 0.01, 0.5, log=True)),
"box_nms_thresh": float(trial.suggest_float("box_nms_thresh", 0.3, 0.7)),
"detections_per_img": int(trial.suggest_int("detections_per_img", 50, 300)),
}
run = wandb.init(
project="faster-rcnn-optuna-coco-minitrain",
name=f"optuna_stage4_trial_{trial.number:04d}",
config=cfg,
reinit=True
)
model = build_model().to(device)
model.load_state_dict(torch.load(FINAL_CKPT, map_location=device))
apply_rpn_hparams(model, best_cfg)
apply_roi_hparams(model, best_cfg)
metrics = evaluate_with_postprocess(model, cfg["score_thresh"], cfg["box_nms_thresh"], cfg["detections_per_img"])
wandb.log({f"val_{k}": v for k, v in metrics.items()})
wandb.finish()
return metrics["mAP"]
study_stage4 = optuna.create_study(direction="maximize", sampler=sampler, pruner=None, study_name="stage4_post")
Copy
N_TRIALS_STAGE4 = int(os.environ.get("HPO_TRIALS", 3)) # default 3 for demo, 30 for assignment
study_stage4.optimize(objective_stage4, n_trials=N_TRIALS_STAGE4, show_progress_bar=True)
print("Best Stage 4:", study_stage4.best_value)
print("Best post params:", study_stage4.best_params)
11. Final multi-seed retraining (required)
Retrain the best configuration (Stages 1–4) with 3 different seeds and report: mean mAP±std. You must log all runs to W&B and include the W&B links in your report.Copy
best_post = study_stage4.best_params if 'study_stage4' in globals() and study_stage4.best_params else {"score_thresh": 0.05, "box_nms_thresh": 0.5, "detections_per_img": 100}
best_full = {**best_cfg, **best_post}
print("Best full config:", best_full)
SEEDS = [11, 22, 33]
ckpts = []
for s in SEEDS:
ckpts.append(train_final_model(best_full, epochs=12, seed=s))
print("ckpts:", ckpts)
# Evaluate each checkpoint with best post-processing
maps = []
for ckpt in ckpts:
model = build_model().to(device)
model.load_state_dict(torch.load(ckpt, map_location=device))
apply_rpn_hparams(model, best_full)
apply_roi_hparams(model, best_full)
apply_postprocess_hparams(model, best_full)
metrics = evaluate_coco_map(model, coco, val_loader)
maps.append(metrics["mAP"])
print(ckpt, metrics)
maps = np.array(maps, dtype=np.float32)
print("mAP mean ± std:", float(maps.mean()), float(maps.std(ddof=1)))
12. Small-object transfer test: drones (extra credit)
You must evaluate:- baseline COCO MiniTrain fine-tuned model (Section 6)
- tuned model (best configuration from Sections 7–11)
Requirements
- Do not retune hyperparameters on drones initially.
- Compute at least:
- mAP, AP50, recall (or COCO AR)
- Provide qualitative results showing:
- missed small drones
- duplicates / NMS issues
- low-confidence detections
Implementation note
You must make the drone dataset available in COCO format (images + instances JSON). Set the paths below accordingly.Copy
# TODO: Set these paths to your drone dataset (COCO format) from Assignment 3
DRONE_ROOT = "/content/drone_dataset" # TODO
DRONE_IMAGES_DIR = os.path.join(DRONE_ROOT, "images") # TODO
DRONE_ANN_JSON = os.path.join(DRONE_ROOT, "annotations", "instances_drone.json") # TODO
# Uncomment after you place the dataset:
# assert os.path.exists(DRONE_IMAGES_DIR), "Set DRONE_IMAGES_DIR"
# assert os.path.exists(DRONE_ANN_JSON), "Set DRONE_ANN_JSON"
# drone_coco = COCO(DRONE_ANN_JSON)
# drone_img_ids = sorted(drone_coco.getImgIds())
# drone_ds = CocoMiniTrainDataset(drone_coco, DRONE_IMAGES_DIR, drone_img_ids, train=False)
# drone_loader = DataLoader(drone_ds, batch_size=1, shuffle=False, num_workers=2, collate_fn=collate_fn)
# def eval_on_drones(ckpt_path: str, tag: str):
# run = wandb.init(project="faster-rcnn-optuna-coco-minitrain", name=f"drone_eval_{tag}", reinit=True)
# model = build_model().to(device)
# model.load_state_dict(torch.load(ckpt_path, map_location=device))
# apply_rpn_hparams(model, best_full)
# apply_roi_hparams(model, best_full)
# apply_postprocess_hparams(model, best_full)
# metrics = evaluate_coco_map(model, drone_coco, drone_loader)
# wandb.log({f"drone_{k}": v for k, v in metrics.items()})
# wandb.finish()
# return metrics
# Example usage (after you set paths):
# baseline_metrics = eval_on_drones(BASELINE_CKPT, "baseline")
# tuned_metrics = eval_on_drones(ckpts[0], "tuned_seed11")
# print("baseline:", baseline_metrics)
# print("tuned:", tuned_metrics)
13. Required written answers (include in your report)
Answer these questions using evidence (W&B plots, metrics, qualitative results):- Which Stage (1–4) delivered the largest gain in mAP? Why?
- Which hyperparameters most influenced small-object recall on drones?
- Did increasing
rpn_pre_nms_topkhelp drone detection? Explain using proposal reasoning. - Did changing NMS thresholds change the duplicate-box failure mode? Provide examples.
- Is the tuned configuration robust across seeds? Use mean±std.

