Skip to main content
The semantic segmentation approach described in this section is Mask R-CNN paper. Mask R-CNN is an extension of Faster R-CNN that adds a mask head to the detector. The mask head is a CNN that takes the feature map output by the RPN and the bounding box coordinates of the detected object and outputs a mask for each object. The mask is a binary image of the same size as the input image, where the pixels of the object are marked as 1 and the rest as 0. In the object detection section we saw R-CNN that simply cropped proposals, generated externally to the detector, from the input image and classifies those proposals. Since the proposals were typically overlapping, CNN computations that extracted features per proposal were wasted and the detector was very slow. Fast R-CNN improved this by passing the whole input image once via a CNN feature extractor and used a feature map internally to this CNN to elect proposals, therefore avoiding the feature extraction per proposal. Faster R-CNN removed the external dependency on proposal generation and introduced a Region Proposal Network (RPN) internally to the detector. For the RPN to generate proposals, prior (or anchor) boxes were defined uniformly across the input image and the RPN was trained to predict the class of each anchor and by how much the anchor needs to shift to match the ground truth bounding box. The code is this section together with the visualizations is useful to understand both Faster RCNN and the mask head extension that ‘colors’ the pixels of the detected objects.

Demo

Notebooks

TensorFlow

The four notebooks in this section use MaskRCNN and are from Matterport’s original implementation - as such they will not work in TF2. For newer versions see the TF Model Garden or the optimized for TPU repo.

PyTorch

There are two main implementations of MaskRCNN. The Detectron2 library, that is oriented towards research projects, offering more flexibility but a steeper learning curve and the model shipped as part of the torchvision library that is simpler to use at the expense of configurability. Key references: (Ren et al., 2015; He et al., 2017; Chen et al., 2018; Peng et al., 2017)

References

  • Chen, L., Zhu, Y., Papandreou, G., Schroff, F., Adam, H. (2018). Encoder-decoder with atrous separable convolution for semantic image segmentation.
  • He, K., Gkioxari, G., Dollár, P., Girshick, R. (2017). Mask R-CNN.
  • Peng, C., Xiao, T., Li, Z., Jiang, Y., Zhang, X., et al. (2017). MegDet: A Large Mini-Batch Object Detector.
  • Ren, S., He, K., Girshick, R., Sun, J. (2015). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.