Detectron2's Mask R-CNN Model

We have selected Mask R-CNN as the model that we base our final deliverable due to its excellent semantic segmentation performance after various optimizations performed in Detectron2 library. The model is also well understood as its based on the family of region-based object detectors (Faster R-CNN). Although semantic segmentation is not expected to benefit from multiscale detection of Feature Pyramid Networks (FPN) because the bird takes pictures from a constant elevation in space, we have nevertheless selected the following general model configurations based on the results of this evaluation for the COCO dataset - selected results relevant to our application are shown below.

Available Backbone Configurations

FPN: Use a ResNet+FPN backbone with standard conv and FC heads for mask and box prediction, respectively. It obtains the best speed/accuracy tradeoff, but the other two are still useful for research.
C4: Use a ResNet conv4 backbone with conv5 head. The original baseline in the Faster R-CNN paper.
DC5 (Dilated-C5): Use a ResNet conv5 backbone with dilations in conv5, and standard conv and FC heads for mask and box prediction, respectively. This is used by the Deformable ConvNet paper.

Most models are trained with the 3x schedule (~37 COCO epochs). Although 1x models are heavily under-trained, we provide some ResNet-50 models with the 1x (~12 COCO epochs) training schedule for comparison when doing quick research iteration.

Available Pretrained Models (ImageNet)

It’s common to initialize from backbone models pre-trained on ImageNet classification tasks. The following backbone models are available:

R-50.pkl: converted copy of MSRA’s original ResNet-50 model.
R-101.pkl: converted copy of MSRA’s original ResNet-101 model.
X-101-32x8d.pkl: ResNeXt-101-32x8d model trained with Caffe2 at FB.

Note that the above models have different format from those provided in Detectron: we do not fuse BatchNorm into an affine layer.

Detectron2 Performance Results

COCO Instance Segmentation Baselines with Mask R-CNN

Name	lr sched	train time (s/iter)	inference time (s/im)	train mem (GB)	box AP	mask AP
R50-C4	1x	0.584	0.110	5.2	36.8	32.2
R50-DC5	1x	0.471	0.076	6.5	38.3	34.2
R50-FPN	1x	0.261	0.043	3.4	38.6	35.2
R50-C4	3x	0.575	0.111	5.2	39.8	34.4
R50-DC5	3x	0.470	0.076	6.5	40.0	35.9
R50-FPN	3x	0.261	0.043	3.4	41.0	37.2
R101-C4	3x	0.652	0.145	6.3	42.6	36.7
R101-DC5	3x	0.545	0.092	7.6	41.9	37.3
R101-FPN	3x	0.340	0.056	4.6	42.9	38.6
X101-FPN	3x	0.690	0.103	7.2	44.3	39.5

New Baselines Using Large-Scale Jitter and Longer Training Schedule

The following baselines of COCO Instance Segmentation with Mask R-CNN are generated using a longer training schedule and large-scale jitter as described in Google’s Simple Copy-Paste Data Augmentation paper. These models are trained from scratch using random initialization. These baselines exceed the previous Mask R-CNN baselines.

Name	epochs	train time (s/im)	inference time (s/im)	box AP	mask AP
R50-FPN	100	0.376	0.069	44.6	40.3
R50-FPN	200	0.376	0.069	46.3	41.7
R50-FPN	400	0.376	0.069	47.4	42.5
R101-FPN	100	0.518	0.073	46.4	41.6
R101-FPN	200	0.518	0.073	48.0	43.1
R101-FPN	400	0.518	0.073	48.9	43.7

Selected Model Configuration

Based on the results above, we have selected model R101-FPN as the baseline model for this project. The model is shown to offer with reasonable training of approx 40 COCO epochs a mask mAP of 38.6 while with data augmentation and much longer training a mAP of 43.7.

Edit this page on GitHub or file an issue.

Overview

Remote Sensing

Manufacturing QC

Detectron2's Mask R-CNN Model

Available Backbone Configurations

Available Pretrained Models (ImageNet)

Detectron2 Performance Results

COCO Instance Segmentation Baselines with Mask R-CNN

New Baselines Using Large-Scale Jitter and Longer Training Schedule

Selected Model Configuration

Overview

Remote Sensing

Manufacturing QC

​Available Backbone Configurations

​Available Pretrained Models (ImageNet)

​Detectron2 Performance Results

​COCO Instance Segmentation Baselines with Mask R-CNN

​New Baselines Using Large-Scale Jitter and Longer Training Schedule

​Selected Model Configuration

Available Backbone Configurations

Available Pretrained Models (ImageNet)

Detectron2 Performance Results

COCO Instance Segmentation Baselines with Mask R-CNN

New Baselines Using Large-Scale Jitter and Longer Training Schedule

Selected Model Configuration