Skip to main content
We have selected Mask R-CNN as the model that we base our final deliverable due to its excellent semantic segmentation performance after various optimizations performed in Detectron2 library. The model is also well understood as its based on the family of region-based object detectors (Faster R-CNN). Although semantic segmentation is not expected to benefit from multiscale detection of Feature Pyramid Networks (FPN) because the bird takes pictures from a constant elevation in space, we have nevertheless selected the following general model configurations based on the results of this evaluation for the COCO dataset - selected results relevant to our application are shown below.

Available Backbone Configurations

  • FPN: Use a ResNet+FPN backbone with standard conv and FC heads for mask and box prediction, respectively. It obtains the best speed/accuracy tradeoff, but the other two are still useful for research.
  • C4: Use a ResNet conv4 backbone with conv5 head. The original baseline in the Faster R-CNN paper.
  • DC5 (Dilated-C5): Use a ResNet conv5 backbone with dilations in conv5, and standard conv and FC heads for mask and box prediction, respectively. This is used by the Deformable ConvNet paper.
Most models are trained with the 3x schedule (~37 COCO epochs). Although 1x models are heavily under-trained, we provide some ResNet-50 models with the 1x (~12 COCO epochs) training schedule for comparison when doing quick research iteration.

Available Pretrained Models (ImageNet)

It’s common to initialize from backbone models pre-trained on ImageNet classification tasks. The following backbone models are available:
  • R-50.pkl: converted copy of MSRA’s original ResNet-50 model.
  • R-101.pkl: converted copy of MSRA’s original ResNet-101 model.
  • X-101-32x8d.pkl: ResNeXt-101-32x8d model trained with Caffe2 at FB.
Note that the above models have different format from those provided in Detectron: we do not fuse BatchNorm into an affine layer.

Detectron2 Performance Results

COCO Instance Segmentation Baselines with Mask R-CNN

Namelr schedtrain time (s/iter)inference time (s/im)train mem (GB)box APmask AP
R50-C41x0.5840.1105.236.832.2
R50-DC51x0.4710.0766.538.334.2
R50-FPN1x0.2610.0433.438.635.2
R50-C43x0.5750.1115.239.834.4
R50-DC53x0.4700.0766.540.035.9
R50-FPN3x0.2610.0433.441.037.2
R101-C43x0.6520.1456.342.636.7
R101-DC53x0.5450.0927.641.937.3
R101-FPN3x0.3400.0564.642.938.6
X101-FPN3x0.6900.1037.244.339.5

New Baselines Using Large-Scale Jitter and Longer Training Schedule

The following baselines of COCO Instance Segmentation with Mask R-CNN are generated using a longer training schedule and large-scale jitter as described in Google’s Simple Copy-Paste Data Augmentation paper. These models are trained from scratch using random initialization. These baselines exceed the previous Mask R-CNN baselines.
Nameepochstrain time (s/im)inference time (s/im)box APmask AP
R50-FPN1000.3760.06944.640.3
R50-FPN2000.3760.06946.341.7
R50-FPN4000.3760.06947.442.5
R101-FPN1000.5180.07346.441.6
R101-FPN2000.5180.07348.043.1
R101-FPN4000.5180.07348.943.7

Selected Model Configuration

Based on the results above, we have selected model R101-FPN as the baseline model for this project. The model is shown to offer with reasonable training of approx 40 COCO epochs a mask mAP of 38.6 while with data augmentation and much longer training a mAP of 43.7.
Connect these docs to Claude, VSCode, and more via MCP for real-time answers.