Introduction
We have selected DeepLabV3+ as the core machine learning model architecture for our project deliverable because of its proven ability to deliver high-quality semantic segmentation across various applications and domains. The model benefits from mature implementations in frameworks like Detectron2, and it has become a go-to choice in professional and academic settings for analyzing high-resolution satellite imagery. Its adoption is well documented across hundreds of academic papers specifically related to remote sensing tasks.
-
Efficient design. By using Atrous convolutions throughout its encoder–decoder structure, DeepLab models achieve a strong balance of accuracy and computational efficiency (Chen et al., 2016).
-
Robust multi-scale context. The model’s Atrous Spatial Pyramid Pooling (ASPP) captures information at multiple scales, enabling it to handle the diverse scales and shapes of sidewalks visible in satellite imagery (Chen et al., 2017).
-
Precise boundary handling. DeepLabV3+ incorporates a decoder module that sharpens segmentation along object edges, making it effective at capturing smaller, non-linear structures like sidewalks (Chen et al., 2018).
-
Proven in remote-sensing practice. Many professionals have successfully applied DeepLabV3+ to remote sensing tasks that require complex feature extraction, demonstrating its reliability as a core focus / comparison model in real-world, high-resolution satellite image use cases, many of which can be found here through a quick search on the MDPI website.
-
Detectron2 support. The model is available out-of-the-box in Detectron2, making it straightforward to train and deploy without requiring costly custom development.
DeepLabV3+ performance is influenced heavily by both the choice of backbone network and the tuning of hyperparameters. In the following sections, we explain our decisions in these areas to show how we tailored the model to maximize results.
Selected DeepLabV3+ Backbone
We selected DeepLabV3+ R103-DC5 from Detectron2’s semantic segmentation model zoo as the backbone for our deeplabv3+ model. This configuration has demonstrated strong performance on Cityscapes with an mIoU of 80.0 at 1024×2048 resolution.
Cityscapes Semantic Segmentation Metrics
Cityscapes models are trained with ImageNet pretraining.
| Method | Backbone | Output resolution | mIoU | model id |
|---|
| DeepLabV3 | R101-DC5 | 1024×2048 | 76.7 | - |
| DeepLabV3 | R103-DC5 | 1024×2048 | 78.5 | 28041665 |
| DeepLabV3+ | R101-DC5 | 1024×2048 | 78.1 | - |
| DeepLabV3+ | R103-DC5 | 1024×2048 | 80.0 | 28054032 |
Why R103-DC5 with DeepLabV3+?
This configuration pairs a ResNet-103 backbone with DC5 dilation and the DeepLabV3+ decoder for optimal balance between spatial detail and contextual awareness in semantic segmentation:
- R103 Backbone — A ResNet-101 variant where the initial 7×7 convolution is replaced by three sequential 3×3 convolutions (“DeepLab stem”), improving preservation of fine-grained spatial information in early layers. Pretrained on ImageNet.
- DC5 — Introduces dilated convolution in the
res5 stage, maintaining higher spatial resolution in deep feature maps for better delineation of thin, elongated structures such as sidewalks.
- DeepLabV3+ Decoder — Combines an Atrous Spatial Pyramid Pooling (ASPP) module for multi-scale context capture with a decoder path that fuses high-level features with low-level detail for boundary refinement.
Performance Rationale
- High-resolution feature extraction from the R103 stem boosts detection of narrow sidewalks in high-resolution imagery.
- Dilated convolution (DC5) preserves geometric detail in deeper layers, essential for thin-structure retention.
- Proven benchmark performance — R103-DC5 has shown superior results on Cityscapes, outperforming R101-DC5 in DeepLabV3+ by +1.9 mIoU (78.1 → 80.0).
- Detectron2 pretrained compatibility, ensuring faster convergence and stronger generalization to satellite imagery.
Config Files and Hyperparameter Choice
In Detectron2, initializations of training jobs require a config.yaml file to be passed so that models and training schedules can be constructed ad hoc for each job. The following is the .yaml file that we utilized to construct and train our DeepLabV3+ model for our first successful attempt at performing a New Jersey state-wide inference job.
Our Model’s Configuration YAML File
# configs/deeplab-v3-plus-resnet103.yaml
VERSION: 2
SEED: 42
MODEL:
META_ARCHITECTURE: "SemanticSegmentor"
DEVICE: "cuda"
WEIGHTS: "detectron2://DeepLab/R-103.pkl"
PIXEL_MEAN: [123.675, 116.280, 103.530]
PIXEL_STD: [58.395, 57.120, 57.375]
BACKBONE:
FREEZE_AT: 2
NAME: "build_resnet_deeplab_backbone"
RESNETS:
DEPTH: 101
NORM: "SyncBN"
OUT_FEATURES: ["res2", "res5"]
RES4_DILATION: 1
RES5_DILATION: 2
RES5_MULTI_GRID: [1, 2, 4]
STEM_TYPE: "deeplab"
STEM_OUT_CHANNELS: 128
STRIDE_IN_1X1: false
SEM_SEG_HEAD:
NAME: "WeightedDeepLabHead"
IN_FEATURES: ["res2", "res5"]
PROJECT_FEATURES: ["res2"]
PROJECT_CHANNELS: [48]
NORM: "SyncBN"
COMMON_STRIDE: 4
ASPP_CHANNELS: 256
ASPP_DILATIONS: [6, 12, 18]
ASPP_DROPOUT: 0.1
CONVS_DIM: 256
USE_DEPTHWISE_SEPARABLE_CONV: false
NUM_CLASSES: 2
IGNORE_VALUE: 255
LOSS_TYPE: "hard_pixel_mining"
TOP_K_PERCENT_PIXELS: 1.0
CLASS_WEIGHT_BACKGROUND: 1.0
CLASS_WEIGHT_FOREGROUND: 10.0
INPUT:
FORMAT: "RGB"
MASK_FORMAT: "bitmask"
RANDOM_FLIP: "horizontal"
MIN_SIZE_TRAIN: [256]
MAX_SIZE_TRAIN: 256
MIN_SIZE_TRAIN_SAMPLING: "choice"
CROP:
ENABLED: false
SOLVER:
IMS_PER_BATCH: 48
BASE_LR: 0.001
MAX_ITER: 100000
LR_SCHEDULER_NAME: "WarmupCosineLR"
WARMUP_FACTOR: 0.001
WARMUP_ITERS: 1000
WARMUP_METHOD: "linear"
CHECKPOINT_PERIOD: 200
MOMENTUM: 0.9
NESTEROV: False
BASE_LR_END: 0.0001
CLIP_GRADIENTS:
ENABLED: True
CLIP_TYPE: "norm"
CLIP_VALUE: 1.0
NORM_TYPE: 2.0
DATALOADER:
NUM_WORKERS: 1
ASPECT_RATIO_GROUPING: false
DATASETS:
TRAIN: ["stream_dummy"]
TEST: []
TEST:
EVAL_PERIOD: 2000
Explanation of Config Parameters
The chosen values were derived from iterative testing and manual adjustments during model development, rather than a formal hyperparameter optimization process. A future step for this pipeline will be systematic hyperparameter tuning in conjunction with training on a larger, higher-quality dataset.
We made significant modifications to the baseline Detectron2 implementation. The most notable change was developing a custom DeepLabV3+ architecture integration tailored for our data ingestion requirements. Our modified implementation supports loading datasets directly from Parquet files in a Hugging Face / Ray-Data compatible in-memory format, enabling faster, more efficient training while minimizing I/O bottlenecks.
Model Architecture Review
This section reviews our deployed DeepLabV3+ (ResNet-101/103 style) semantic segmentor for sidewalk detection.
Overview Table
| Component | Details |
|---|
| Input | name: image, tensor: float32[batch_size, 3, height, width] |
| Output | name: sem_seg, tensor: float32[batch_size, 2, height, width] |
| Parameters | 59,408,266 |
Backbone (ResNet with DeepLab Stem)
| Stage | Block / Layer | Channels (in→out) | Kernel / Dilation | Stride | Norm | Freeze |
|---|
| Stem (frozen) | conv1 → conv2 → conv3 | 3→64→64→128 | 3×3 / 1 | 2,1,1 | SyncBN | Yes |
| res2 (frozen) | Bottleneck ×3 | 128→256 | (1×1, 3×3, 1×1) / 1 | 1 | SyncBN | Yes |
| res3 | Bottleneck ×4 | 256→512 | (1×1, 3×3, 1×1) / 1 | 2 (first) | SyncBN | No |
| res4 | Bottleneck ×23 | 512→1024 | (1×1, 3×3, 1×1) / 1 | 2 (first) | SyncBN | No |
| res5 | Bottleneck ×3 | 1024→2048 | (1×1, 3×3, 1×1) / 2,4,8 | 1 | SyncBN | No |
Decoder & Segmentation Head (DeepLabV3+)
| Component | Block / Layer | Channels (in→out) | Kernel / Dilation | Stride | Norm |
|---|
| ASPP (res5) | 5 parallel branches | 2048→(256×4) & 2048→256 | 1×1; 3×3 / 6,12,18; GAP→1×1 | 1 | SyncBN |
| Low-level proj (res2) | 1×1 projection | 256→48 | 1×1 / 1 | 1 | SyncBN |
| Fusion convs | 3×3 convs ×2 | 304→256→256 | 3×3 / 1 | 1 | SyncBN |
| Predictor | 1×1 classifier | 256→2 | 1×1 / 1 | 1 | — |
| Loss | Cross-Entropy (weighted) | — | — | — | — |
Key Takeaways
- Efficient by design: We keep output stride ≈16 using atrous convs in res5, so the encoder preserves detail without extra downsampling.
- Robust multi-scale context (ASPP): ASPP aggregates context via 1×1 and 3×3 (dilations 6,12,18) plus image pooling, supporting varying sidewalk widths and scene scales.
- Precise boundary handling (decoder): Low-level skip from res2 (256→48) preserves edges; fusing with ASPP sharpens thin, curvilinear structures.
- Clear Model Architecture: Enables clean training, ONNX export, and Triton serving without exotic ops or custom layers.
ONNX (Open Neural Network Exchange) is an open standard for representing machine learning models that provides portability across frameworks and inference deployment backends. Exporting to ONNX format allows us to take a model trained in PyTorch with Detectron2 and deploy it efficiently across different hardware and serving platforms.
Export Process Overview
We exported our SemanticSegmentor (DeepLabV3+ with WeightedDeepLabHead) to ONNX in three main steps:
- Load Model and Weights - Resolve the model configuration, instantiate the model, and load the final trained checkpoint.
- Wrap for Batched Inference - Define a custom
BatchedWrapper class for tensor-only forward signature.
- Export with
torch.onnx.export - Use a dummy input for tracing with dynamic axes for batch size, height, and width.
torch.onnx.export(
wrapped_model,
dummy_tensor,
"batched_semseg_model.onnx",
opset_version=16,
input_names=["image"],
output_names=["sem_seg"],
dynamic_axes={
"image": {0: "batch_size", 2: "height", 3: "width"},
"sem_seg": {0: "batch_size", 2: "height", 3: "width"},
}
)
The ONNX-exported model does not apply input normalization internally. For deployment, apply normalization (subtract PIXEL_MEAN, divide by PIXEL_STD) as a separate preprocessing step before passing inputs to the ONNX model.
Deployment Readiness
The exported ONNX model is now ready for use with:
- ONNX Runtime (CPU/GPU inference)
- Triton Inference Server (scalable deployment)
- TensorRT (optimized GPU inference)
Chapter Summary
The design of DeepLabV3+ allows it to excel in detecting sidewalks within high-resolution urban imagery by combining:
- Rich semantic features from a deep backbone to differentiate sidewalks from visually similar surfaces
- High-resolution feature preservation to maintain the narrow, elongated geometry typical of sidewalks
- Multi-scale context capture to identify sidewalks across diverse widths, textures, and environmental settings
- Refined boundary prediction through low-level feature fusion
This balance of global contextual consideration and fine, local feature preservation makes the model well-suited for accurate, scalable sidewalk segmentation across the state of New Jersey.