Available Backbone Configurations
- FPN: Use a ResNet+FPN backbone with standard conv and FC heads for mask and box prediction, respectively. It obtains the best speed/accuracy tradeoff, but the other two are still useful for research.
- C4: Use a ResNet conv4 backbone with conv5 head. The original baseline in the Faster R-CNN paper.
- DC5 (Dilated-C5): Use a ResNet conv5 backbone with dilations in conv5, and standard conv and FC heads for mask and box prediction, respectively. This is used by the Deformable ConvNet paper.
Available Pretrained Models (ImageNet)
It’s common to initialize from backbone models pre-trained on ImageNet classification tasks. The following backbone models are available:- R-50.pkl: converted copy of MSRA’s original ResNet-50 model.
- R-101.pkl: converted copy of MSRA’s original ResNet-101 model.
- X-101-32x8d.pkl: ResNeXt-101-32x8d model trained with Caffe2 at FB.
Detectron2 Performance Results
COCO Instance Segmentation Baselines with Mask R-CNN
| Name | lr sched | train time (s/iter) | inference time (s/im) | train mem (GB) | box AP | mask AP |
|---|---|---|---|---|---|---|
| R50-C4 | 1x | 0.584 | 0.110 | 5.2 | 36.8 | 32.2 |
| R50-DC5 | 1x | 0.471 | 0.076 | 6.5 | 38.3 | 34.2 |
| R50-FPN | 1x | 0.261 | 0.043 | 3.4 | 38.6 | 35.2 |
| R50-C4 | 3x | 0.575 | 0.111 | 5.2 | 39.8 | 34.4 |
| R50-DC5 | 3x | 0.470 | 0.076 | 6.5 | 40.0 | 35.9 |
| R50-FPN | 3x | 0.261 | 0.043 | 3.4 | 41.0 | 37.2 |
| R101-C4 | 3x | 0.652 | 0.145 | 6.3 | 42.6 | 36.7 |
| R101-DC5 | 3x | 0.545 | 0.092 | 7.6 | 41.9 | 37.3 |
| R101-FPN | 3x | 0.340 | 0.056 | 4.6 | 42.9 | 38.6 |
| X101-FPN | 3x | 0.690 | 0.103 | 7.2 | 44.3 | 39.5 |
New Baselines Using Large-Scale Jitter and Longer Training Schedule
The following baselines of COCO Instance Segmentation with Mask R-CNN are generated using a longer training schedule and large-scale jitter as described in Google’s Simple Copy-Paste Data Augmentation paper. These models are trained from scratch using random initialization. These baselines exceed the previous Mask R-CNN baselines.| Name | epochs | train time (s/im) | inference time (s/im) | box AP | mask AP |
|---|---|---|---|---|---|
| R50-FPN | 100 | 0.376 | 0.069 | 44.6 | 40.3 |
| R50-FPN | 200 | 0.376 | 0.069 | 46.3 | 41.7 |
| R50-FPN | 400 | 0.376 | 0.069 | 47.4 | 42.5 |
| R101-FPN | 100 | 0.518 | 0.073 | 46.4 | 41.6 |
| R101-FPN | 200 | 0.518 | 0.073 | 48.0 | 43.1 |
| R101-FPN | 400 | 0.518 | 0.073 | 48.9 | 43.7 |
Selected Model Configuration
Based on the results above, we have selected modelR101-FPN as the baseline model for this project. The model is shown to offer with reasonable training of approx 40 COCO epochs a mask mAP of 38.6 while with data augmentation and much longer training a mAP of 43.7.

