Skip to main content

Introduction

We have selected DeepLabV3+ as the core machine learning model architecture for our project deliverable because of its proven ability to deliver high-quality semantic segmentation across various applications and domains. The model benefits from mature implementations in frameworks like Detectron2, and it has become a go-to choice in professional and academic settings for analyzing high-resolution satellite imagery. Its adoption is well documented across hundreds of academic papers specifically related to remote sensing tasks.
  • Efficient design. By using Atrous convolutions throughout its encoder–decoder structure, DeepLab models achieve a strong balance of accuracy and computational efficiency (Chen et al., 2016).
  • Robust multi-scale context. The model’s Atrous Spatial Pyramid Pooling (ASPP) captures information at multiple scales, enabling it to handle the diverse scales and shapes of sidewalks visible in satellite imagery (Chen et al., 2017).
  • Precise boundary handling. DeepLabV3+ incorporates a decoder module that sharpens segmentation along object edges, making it effective at capturing smaller, non-linear structures like sidewalks (Chen et al., 2018).
  • Proven in remote-sensing practice. Many professionals have successfully applied DeepLabV3+ to remote sensing tasks that require complex feature extraction, demonstrating its reliability as a core focus / comparison model in real-world, high-resolution satellite image use cases, many of which can be found here through a quick search on the MDPI website.
  • Detectron2 support. The model is available out-of-the-box in Detectron2, making it straightforward to train and deploy without requiring costly custom development.
DeepLabV3+ performance is influenced heavily by both the choice of backbone network and the tuning of hyperparameters. In the following sections, we explain our decisions in these areas to show how we tailored the model to maximize results.

Selected DeepLabV3+ Backbone

We selected DeepLabV3+ R103-DC5 from Detectron2’s semantic segmentation model zoo as the backbone for our deeplabv3+ model. This configuration has demonstrated strong performance on Cityscapes with an mIoU of 80.0 at 1024×2048 resolution.

Cityscapes Semantic Segmentation Metrics

Cityscapes models are trained with ImageNet pretraining.
MethodBackboneOutput resolutionmIoUmodel id
DeepLabV3R101-DC51024×204876.7-
DeepLabV3R103-DC51024×204878.528041665
DeepLabV3+R101-DC51024×204878.1-
DeepLabV3+R103-DC51024×204880.028054032

Why R103-DC5 with DeepLabV3+?

This configuration pairs a ResNet-103 backbone with DC5 dilation and the DeepLabV3+ decoder for optimal balance between spatial detail and contextual awareness in semantic segmentation:
  • R103 Backbone — A ResNet-101 variant where the initial 7×7 convolution is replaced by three sequential 3×3 convolutions (“DeepLab stem”), improving preservation of fine-grained spatial information in early layers. Pretrained on ImageNet.
  • DC5 — Introduces dilated convolution in the res5 stage, maintaining higher spatial resolution in deep feature maps for better delineation of thin, elongated structures such as sidewalks.
  • DeepLabV3+ Decoder — Combines an Atrous Spatial Pyramid Pooling (ASPP) module for multi-scale context capture with a decoder path that fuses high-level features with low-level detail for boundary refinement.
Performance Rationale
  1. High-resolution feature extraction from the R103 stem boosts detection of narrow sidewalks in high-resolution imagery.
  2. Dilated convolution (DC5) preserves geometric detail in deeper layers, essential for thin-structure retention.
  3. Proven benchmark performance — R103-DC5 has shown superior results on Cityscapes, outperforming R101-DC5 in DeepLabV3+ by +1.9 mIoU (78.1 → 80.0).
  4. Detectron2 pretrained compatibility, ensuring faster convergence and stronger generalization to satellite imagery.

Config Files and Hyperparameter Choice

In Detectron2, initializations of training jobs require a config.yaml file to be passed so that models and training schedules can be constructed ad hoc for each job. The following is the .yaml file that we utilized to construct and train our DeepLabV3+ model for our first successful attempt at performing a New Jersey state-wide inference job.

Our Model’s Configuration YAML File

# configs/deeplab-v3-plus-resnet103.yaml
VERSION: 2
SEED: 42
MODEL:
  META_ARCHITECTURE: "SemanticSegmentor"
  DEVICE: "cuda"
  WEIGHTS: "detectron2://DeepLab/R-103.pkl"

  PIXEL_MEAN: [123.675, 116.280, 103.530]
  PIXEL_STD: [58.395, 57.120, 57.375]

  BACKBONE:
    FREEZE_AT: 2
    NAME: "build_resnet_deeplab_backbone"

  RESNETS:
    DEPTH: 101
    NORM: "SyncBN"
    OUT_FEATURES: ["res2", "res5"]
    RES4_DILATION: 1
    RES5_DILATION: 2
    RES5_MULTI_GRID: [1, 2, 4]
    STEM_TYPE: "deeplab"
    STEM_OUT_CHANNELS: 128
    STRIDE_IN_1X1: false

  SEM_SEG_HEAD:
    NAME: "WeightedDeepLabHead"
    IN_FEATURES: ["res2", "res5"]
    PROJECT_FEATURES: ["res2"]
    PROJECT_CHANNELS: [48]
    NORM: "SyncBN"
    COMMON_STRIDE: 4
    ASPP_CHANNELS: 256
    ASPP_DILATIONS: [6, 12, 18]
    ASPP_DROPOUT: 0.1
    CONVS_DIM: 256
    USE_DEPTHWISE_SEPARABLE_CONV: false
    NUM_CLASSES: 2
    IGNORE_VALUE: 255

    LOSS_TYPE: "hard_pixel_mining"
    TOP_K_PERCENT_PIXELS: 1.0
    CLASS_WEIGHT_BACKGROUND: 1.0
    CLASS_WEIGHT_FOREGROUND: 10.0

INPUT:
  FORMAT: "RGB"
  MASK_FORMAT: "bitmask"
  RANDOM_FLIP: "horizontal"
  MIN_SIZE_TRAIN: [256]
  MAX_SIZE_TRAIN: 256
  MIN_SIZE_TRAIN_SAMPLING: "choice"
  CROP:
    ENABLED: false

SOLVER:
  IMS_PER_BATCH: 48
  BASE_LR: 0.001
  MAX_ITER: 100000
  LR_SCHEDULER_NAME: "WarmupCosineLR"
  WARMUP_FACTOR: 0.001
  WARMUP_ITERS: 1000
  WARMUP_METHOD: "linear"
  CHECKPOINT_PERIOD: 200
  MOMENTUM: 0.9
  NESTEROV: False
  BASE_LR_END: 0.0001

  CLIP_GRADIENTS:
    ENABLED: True
    CLIP_TYPE: "norm"
    CLIP_VALUE: 1.0
    NORM_TYPE: 2.0

DATALOADER:
  NUM_WORKERS: 1
  ASPECT_RATIO_GROUPING: false

DATASETS:
  TRAIN: ["stream_dummy"]
  TEST: []

TEST:
  EVAL_PERIOD: 2000

Explanation of Config Parameters

The chosen values were derived from iterative testing and manual adjustments during model development, rather than a formal hyperparameter optimization process. A future step for this pipeline will be systematic hyperparameter tuning in conjunction with training on a larger, higher-quality dataset. We made significant modifications to the baseline Detectron2 implementation. The most notable change was developing a custom DeepLabV3+ architecture integration tailored for our data ingestion requirements. Our modified implementation supports loading datasets directly from Parquet files in a Hugging Face / Ray-Data compatible in-memory format, enabling faster, more efficient training while minimizing I/O bottlenecks.

Model Architecture Review

This section reviews our deployed DeepLabV3+ (ResNet-101/103 style) semantic segmentor for sidewalk detection.

Overview Table

ComponentDetails
Inputname: image, tensor: float32[batch_size, 3, height, width]
Outputname: sem_seg, tensor: float32[batch_size, 2, height, width]
Parameters59,408,266

Backbone (ResNet with DeepLab Stem)

StageBlock / LayerChannels (in→out)Kernel / DilationStrideNormFreeze
Stem (frozen)conv1 → conv2 → conv33→64→64→1283×3 / 12,1,1SyncBNYes
res2 (frozen)Bottleneck ×3128→256(1×1, 3×3, 1×1) / 11SyncBNYes
res3Bottleneck ×4256→512(1×1, 3×3, 1×1) / 12 (first)SyncBNNo
res4Bottleneck ×23512→1024(1×1, 3×3, 1×1) / 12 (first)SyncBNNo
res5Bottleneck ×31024→2048(1×1, 3×3, 1×1) / 2,4,81SyncBNNo

Decoder & Segmentation Head (DeepLabV3+)

ComponentBlock / LayerChannels (in→out)Kernel / DilationStrideNorm
ASPP (res5)5 parallel branches2048→(256×4) & 2048→2561×1; 3×3 / 6,12,18; GAP→1×11SyncBN
Low-level proj (res2)1×1 projection256→481×1 / 11SyncBN
Fusion convs3×3 convs ×2304→256→2563×3 / 11SyncBN
Predictor1×1 classifier256→21×1 / 11
LossCross-Entropy (weighted)

Key Takeaways

  • Efficient by design: We keep output stride ≈16 using atrous convs in res5, so the encoder preserves detail without extra downsampling.
  • Robust multi-scale context (ASPP): ASPP aggregates context via 1×1 and 3×3 (dilations 6,12,18) plus image pooling, supporting varying sidewalk widths and scene scales.
  • Precise boundary handling (decoder): Low-level skip from res2 (256→48) preserves edges; fusing with ASPP sharpens thin, curvilinear structures.
  • Clear Model Architecture: Enables clean training, ONNX export, and Triton serving without exotic ops or custom layers.

Export to ONNX Format

ONNX (Open Neural Network Exchange) is an open standard for representing machine learning models that provides portability across frameworks and inference deployment backends. Exporting to ONNX format allows us to take a model trained in PyTorch with Detectron2 and deploy it efficiently across different hardware and serving platforms.

Export Process Overview

We exported our SemanticSegmentor (DeepLabV3+ with WeightedDeepLabHead) to ONNX in three main steps:
  1. Load Model and Weights - Resolve the model configuration, instantiate the model, and load the final trained checkpoint.
  2. Wrap for Batched Inference - Define a custom BatchedWrapper class for tensor-only forward signature.
  3. Export with torch.onnx.export - Use a dummy input for tracing with dynamic axes for batch size, height, and width.
torch.onnx.export(
    wrapped_model,
    dummy_tensor,
    "batched_semseg_model.onnx",
    opset_version=16,
    input_names=["image"],
    output_names=["sem_seg"],
    dynamic_axes={
        "image": {0: "batch_size", 2: "height", 3: "width"},
        "sem_seg": {0: "batch_size", 2: "height", 3: "width"},
    }
)
The ONNX-exported model does not apply input normalization internally. For deployment, apply normalization (subtract PIXEL_MEAN, divide by PIXEL_STD) as a separate preprocessing step before passing inputs to the ONNX model.

Deployment Readiness

The exported ONNX model is now ready for use with:
  • ONNX Runtime (CPU/GPU inference)
  • Triton Inference Server (scalable deployment)
  • TensorRT (optimized GPU inference)

Chapter Summary

The design of DeepLabV3+ allows it to excel in detecting sidewalks within high-resolution urban imagery by combining:
  • Rich semantic features from a deep backbone to differentiate sidewalks from visually similar surfaces
  • High-resolution feature preservation to maintain the narrow, elongated geometry typical of sidewalks
  • Multi-scale context capture to identify sidewalks across diverse widths, textures, and environmental settings
  • Refined boundary prediction through low-level feature fusion
This balance of global contextual consideration and fine, local feature preservation makes the model well-suited for accurate, scalable sidewalk segmentation across the state of New Jersey.
Connect these docs to Claude, VSCode, and more via MCP for real-time answers.