Mask R-CNN: A detailed guide with Detectron2:

Welcome to the Mask R-CNN with Detectron2 tutorial!

In this tutorial, we will jump into the workings of Mask R-CNN, a state-of-the-art framework for object instance segmentation.

We’ll be using the Detectron2 library, which provides efficient implementation of various object detection and segmentation algorithms. By the end of this tutorial, you’ll have a comprehensive understanding of how Mask R-CNN works and how to use Detectron2 to train and deploy your own instance segmentation models.

Let’s dive in!

Table of Contents:

Introducing Mask R-CNN:

Mask R-CNN is an framework that builds upon a series of developments in deep learning for computer vision to achieve state-of-the-art performance in instance segmentation tasks. Instance segmentation is the process of identifying and delineating each distinct object of interest in an image. Unlike simpler tasks such as classification (identifying what objects are present) or detection (locating objects), instance segmentation provides detailed information about the shape and exact boundaries of each object.

Foundation and Evolution:

To understand Mask R-CNN, it’s essential to trace its roots back through the evolution of related models:

R-CNN (Regions with Convolutional Neural Networks) laid the groundwork by using selective search to propose regions that might contain objects and then classifying each proposed region using CNN features.

Fast R-CNN improved efficiency by sharing computations across region proposals, using a technique called RoIPool (Region of Interest Pooling) to extract a fixed-size feature vector from each proposal.

Faster R-CNN introduced the Region Proposal Network (RPN), a fully convolutional network that predicts object bounds and objectness scores at each position. This innovation allowed the model to generate high-quality region proposals, which are then passed to a Fast R-CNN model for classification.

Understanding Mask R-CNN:

Mask R-CNN operates in two stages: the first involves generating proposals for objects within an image, and the second refines these proposals, classifying the objects and generating bounding boxes and segmentation masks. This architecture is built upon the success of Faster R-CNN but introduces a significant innovation with the mask branch, which works in parallel to the bounding box and classification branches.



One of the critical improvements Mask R-CNN introduces is the RoIAlign layer, which ensures that the extraction of features from each Region of Interest (RoI) is precisely aligned with the input, preserving the exact spatial locations. This precision is crucial for generating accurate masks and is a considerable improvement over the approximate spatial sampling in previous models. The approach of predicting a binary mask for each class independently, without competition among classes, is another key aspect that distinguishes Mask R-CNN.

Overview of Detectron2:

Detectron2 is a software system that implements state-of-the-art object detection algorithms, including Mask R-CNN. Developed by Facebook AI Research (FAIR), Detectron2 offers a robust and flexible framework for computer vision tasks, enabling researchers and developers to build, train, and deploy object detection models quickly.

Key features of Detectron2 include a comprehensive model zoo that provides pre-trained models for a wide variety of tasks, a modular design that allows for easy experimentation with different model architectures, and support for the latest models and algorithms in object detection.

Detectron2’s design emphasizes flexibility, allowing users to adapt the framework to their specific needs while providing a powerful toolset for developing cutting-edge computer vision applications

First, we need to install Detectron2 to make it available for use throughout the tutorial:

Each code snippet plays a crucial role in the process. Each code snippet will be marking with a cell number. The corresponding markdown cell will serve as the explanation for the contents of the code.

Environment and Library Setup

Cell #1 - Environment and Library Setup

Environment and Library Setup Firstly, we import essential libraries and setup the logger. This includes cv2 (OpenCV) for image operations, numpy for numerical computations, and Detectron2 specific utilities such as model_zoo, get_cfg, DefaultTrainer, and DefaultPredictor. The setup_logger() call ensures that we can see the output from Detectron2 operations.

Visualization and Data Handling Next, we import Detectron2’s visualization tools (Visualizer, ColorMode) and data handling utilities (MetadataCatalog, DatasetCatalog, build_detection_test_loader, BoxMode). These are crucial for managing our dataset’s metadata, registering our datasets for training and evaluation, and for visualizing predictions.

Evaluation Tools For evaluating our model’s performance, we use COCOEvaluator and inference_on_dataset from Detectron2’s evaluation module. Additionally, display and Image from IPython are imported for inline image display within Jupyter notebooks

#Cell 1 - Environment and Library Setup

import os
import json
import cv2
import numpy as np
import random
import requests
from detectron2 import model_zoo
from detectron2.config import get_cfg
from detectron2.engine import DefaultTrainer, DefaultPredictor
from detectron2.utils.logger import setup_logger
from detectron2.utils.visualizer import Visualizer, ColorMode
from import MetadataCatalog, DatasetCatalog, build_detection_test_loader
from detectron2.structures import BoxMode
from detectron2.evaluation import COCOEvaluator, inference_on_dataset
from IPython.display import display, Image

Working with Pre-trained Models:

Function to Display Images in Jupyter Notebooks

We start by defining a custom function, cv2_imshow_in_notebook, specifically designed for Jupyter notebooks. This function ensures that images are properly encoded and displayed inline.

# Cell 2 - Function to display images in Jupyter notebooks
def cv2_imshow_in_notebook(img):
    _, img_encoded = cv2.imencode('.jpg', img)

Download a Sample Image From the COCO Dataset

We acquire a sample image from the COCO dataset.The COCO (Common Objects in Context) dataset is a large-scale image recognition dataset for object detection, segmentation, and captioning tasks. It contains over 330,000 images, each annotated with 80 object categories and 5 captions describing the scene. The requests library is used to download the image, which is then loaded using OpenCV for processing.

#Cell 3 - Download a sample image from the COCO dataset
image_url = ""
image_path = "input.jpg"
r = requests.get(image_url)
with open(image_path, 'wb') as f:

# Load the image
im = cv2.imread(image_path)

Loading and Using Pre-trained Mask R-CNN Model

Initialize Configuration: We start with a default configuration and then load the settings for a Mask R-CNN model designed for the COCO dataset.

Set Detection Threshold: We adjust the model to only consider detections with a confidence score above 0.5 as valid.

Load Pre-trained Weights: The model is configured to use weights pre-trained on the COCO dataset, making sure there is accurate detection and segmentation.

Predictor Initialization: A DefaultPredictor is created with our settings, ready to process images.

Inference: We run inference on an image, producing predictions including object classes, bounding boxes, and segmentation masks.

#Cell 4 - Configure and load a pre-trained model from Detectron2 model zoo for demonstration
cfg_demo = get_cfg()
cfg_demo.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5  # set threshold for this model
cfg_demo.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml")
predictor_demo = DefaultPredictor(cfg_demo)
outputs_demo = predictor_demo(im)
[03/24 01:02:30 d2.checkpoint.detection_checkpoint]: [DetectionCheckpointer] Loading from ...

Visualize the Prediction on the Sample Image and Display the Image With Predictions

Visualization Setup: Using Detectron2’s Visualizer, the code prepares to visualize the predictions. The image is adjusted for color display, and metadata for the dataset is provided for accurate annotation.

Drawing Predictions: It then draws the detected instances (objects and their segmentation masks) on the image. The predictions include details like bounding boxes and class labels.

Display in Notebook: Finally, the modified image, now annotated with predictions, is displayed directly in the Jupyter Notebook using a custom function that handles OpenCV image display compatibility.

#Cell 5 - Visualize the prediction on the sample image and display the image with predictions in Jupyter Notebook
v_demo = Visualizer(im[:, :, ::-1], MetadataCatalog.get(cfg_demo.DATASETS.TRAIN[0]), scale=1.2)
out_demo = v_demo.draw_instance_predictions(outputs_demo["instances"].to("cpu"))

cv2_imshow_in_notebook(out_demo.get_image()[:, :, ::-1])

Training Mask R-CNN on a Custom Dataset

Downloading the Dataset

The function uses os.system() to execute a command line instruction that downloads the balloon dataset zip file from a specified URL using wget. The dataset is sourced from the official GitHub release page of the Mask R-CNN project by Matterport.

Extracting the Dataset: Following the download, another os.system() call is made to unzip the downloaded file.

#Cell 6 -Downloading the dataset

def download_and_unzip_balloon_dataset():
    os.system("wget -O")
    os.system("unzip -o > /dev/null")

If you already have a dataset downloaded, you can skip the custom dataset preparation step.

This is the code you would use:

Register the dataset: register_coco_instances(“my_dataset_train”, {}, “path/to/train_annotation.json”, “path/to/train_images/”) register_coco_instances(“my_dataset_val”, {}, “path/to/val_annotation.json”, “path/to/val_images/”)

Optionally, get metadata for visualization purposes: my_dataset_metadata = MetadataCatalog.get(“my_dataset_train”)

Optionally, visualize some samples from the dataset: dataset_dicts = DatasetCatalog.get(“my_dataset_train”)

Preparing the Custom Dataset

Dataset Parsing: The get_balloon_dicts function reads dataset annotations from a JSON file (via_region_data.json). Each image and its annotations are loaded and processed.

Image Metadata Collection: For each image, metadata such as the file path, image ID, height, and width are collected. This information is essential for Detectron2 to understand how to process and display each image.

Annotation Processing: The annotations, which include points defining the polygons around each instance of the object of interest (balloons, in this case), are converted into a format compatible with Detectron2. This includes creating bounding boxes and specifying the object category.

Dataset Registration: The DatasetCatalog.register function is used to register the dataset for training and validation purposes. This step makes the dataset recognizable by Detectron2, allowing it to be used in training and evaluation workflows.

Metadata Setting: The MetadataCatalog.get(…).set(…) function assigns metadata, such as class names ([“balloon”]), to the dataset. This metadata is useful for understanding the predictions and is used in visualization.

#Cell 7 - Preparing the custom dataset

import os
from import DatasetCatalog, MetadataCatalog

def download_and_unzip_balloon_dataset():
    os.system("wget -O")
    os.system("unzip -o > /dev/null")

def get_balloon_dicts(img_dir):
    json_file = os.path.join(img_dir, "via_region_data.json")
    with open(json_file) as f:
        imgs_anns = json.load(f)

    dataset_dicts = []
    for idx, v in enumerate(imgs_anns.values()):
        record = {}
        filename = os.path.join(img_dir, v["filename"])
        height, width = cv2.imread(filename).shape[:2]
        record["file_name"] = filename
        record["image_id"] = idx
        record["height"] = height
        record["width"] = width
        annos = v["regions"]
        objs = []
        for _, anno in annos.items():
            anno = anno["shape_attributes"]
            px = anno["all_points_x"]
            py = anno["all_points_y"]
            poly = [(x + 0.5, y + 0.5) for x, y in zip(px, py)]
            poly = [p for x in poly for p in x]

            obj = {
                "bbox": [np.min(px), np.min(py), np.max(px), np.max(py)],
                "bbox_mode": BoxMode.XYXY_ABS,
                "segmentation": [poly],
                "category_id": 0,
        record["annotations"] = objs
    return dataset_dicts

# Download and prepare the balloon dataset
for d in ["train", "val"]:
    DatasetCatalog.register("balloon_" + d, lambda d=d: get_balloon_dicts(f"balloon/{d}"))
    MetadataCatalog.get("balloon_" + d).set(thing_classes=["balloon"])
Model Configuration

Model Architecture: This involves choosing a specific model architecture suitable for the task at hand. We are using Mask R-CNN. The architecture is defined by a configuration file or parameters that describe the model layers, sizes, and types.

Pre-trained Weights: Utilizing pre-trained weights from a model zoo accelerates training and improves performance, especially when data is limited. These weights serve as a starting point, and Detectron2 offers access to a wide range of models pre-trained on large datasets like COCO.

Hyperparameters: Setting hyperparameters such as the learning rate, batch size, number of iterations, etc., is crucial. These parameters significantly impact the training process’s efficiency and the final model’s accuracy.

Thresholds for Detection: Defining thresholds, such as the minimum score for detecting objects, helps in filtering out less confident predictions, thus refining the results.

#Cell 8 - Configuration for the model

from detectron2.config import get_cfg
from detectron2 import model_zoo

cfg = get_cfg() #Initializes the configuration using Detectron2's default settings.
cfg.merge_from_file(model_zoo.get_config_file("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml"))#Loads and merges a configuration file from the model zoo
cfg.DATASETS.TRAIN = ("balloon_train",)#Specifies the training dataset identifier.
cfg.DATASETS.TEST = ("balloon_val",)#Specifies the validation dataset identifier. These identifiers are linked to datasets that must be registered elsewhere in the code.
cfg.DATALOADER.NUM_WORKERS = 2#Sets the number of worker processes for loading data. Adjusting this can affect loading speed and training throughput
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml")#Initializes the model with weights from a pre-trained model. This uses transfer learning, where a model trained on a large and general dataset (COCO) is fine-tuned for a specific task (balloon instance segmentation).
cfg.SOLVER.IMS_PER_BATCH = 2#The number of images per training batch.
cfg.SOLVER.BASE_LR = 0.00025#The base learning rate.
cfg.SOLVER.MAX_ITER = 300#The maximum number of training iterations.
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 128 #The number of region proposal objects per image during training.
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1 #ets the number of classes for the model to detect. In this case, there's only one class (balloon).
cfg.OUTPUT_DIR = './output' #Specifies the directory where output files, such as trained model checkpoints and performance metrics, will be saved.

os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)#This line of code makes sure the creation of the output directory specified in cfg.OUTPUT_DIR ('./output').

Initialize the Trainer and Start Training

The training process involves multiple epochs or iterations through the dataset, optimizing the model’s weights to minimize the loss function. Throughout the training, the trainer will utilize the dataset specified in the configuration to train the model, periodically evaluating it on the validation set (if specified) to monitor the performance.

DefaultTrainer: This is a class provided by Detectron2 designed to simplify the training process of models. It includes the complex training logic needed for computer vision tasks, such as instance segmentation, object detection, etc. The DefaultTrainer class is initialized with a configuration object (cfg) that includes all the settings and parameters for the model, the data, and the training process itself.

trainer = DefaultTrainer(cfg): This line creates an instance of the DefaultTrainer class with the specified configuration. The cfg object passed to DefaultTrainer contains all the necessary settings, including the model architecture, dataset information, hyperparameters, and other training options. This step effectively prepares the training environment with all the specified configurations.

trainer.resume_or_load(resume=False): Before starting the training, this method is called on the trainer object. Its primary purpose is to load a previously saved model or to resume training from a checkpoint.

trainer.train(): This line initiates the training process. Once this method is called, the DefaultTrainer begins training the model using the data, model architecture, and hyperparameters defined in the cfg object.

#Cell 9 - Initialize the trainer and start training

from detectron2.engine import DefaultTrainer
import json
import cv2
import numpy as np
from detectron2.structures import BoxMode

trainer = DefaultTrainer(cfg) 
[03/24 01:10:10]:  eta: 0:04:40  iter: 19  total_loss: 2.065  loss_cls: 0.6864  loss_box_reg: 0.548  loss_mask: 0.6894  loss_rpn_cls: 0.04892  loss_rpn_loc: 0.006012    time: 0.9925  last_time: 0.8960  data_time: 0.0228  last_data_time: 0.0034   lr: 1.6068e-05  max_mem: 2738M
[03/24 01:10:30]:  eta: 0:04:19  iter: 39  total_loss: 1.878  loss_cls: 0.5938  loss_box_reg: 0.6185  loss_mask: 0.6034  loss_rpn_cls: 0.02946  loss_rpn_loc: 0.005873    time: 0.9980  last_time: 1.1743  data_time: 0.0075  last_data_time: 0.0095   lr: 3.2718e-05  max_mem: 2738M
[03/24 01:10:51]:  eta: 0:03:59  iter: 59  total_loss: 1.611  loss_cls: 0.4638  loss_box_reg: 0.6267  loss_mask: 0.4823  loss_rpn_cls: 0.03451  loss_rpn_loc: 0.003271    time: 1.0144  last_time: 1.1730  data_time: 0.0073  last_data_time: 0.0038   lr: 4.9367e-05  max_mem: 2750M
[03/24 01:11:12]:  eta: 0:03:40  iter: 79  total_loss: 1.489  loss_cls: 0.368  loss_box_reg: 0.6775  loss_mask: 0.3813  loss_rpn_cls: 0.01245  loss_rpn_loc: 0.007338    time: 1.0203  last_time: 0.8949  data_time: 0.0084  last_data_time: 0.0125   lr: 6.6017e-05  max_mem: 2750M
[03/24 01:11:34]:  eta: 0:03:22  iter: 99  total_loss: 1.22  loss_cls: 0.2809  loss_box_reg: 0.6749  loss_mask: 0.2885  loss_rpn_cls: 0.02219  loss_rpn_loc: 0.007728    time: 1.0321  last_time: 0.9512  data_time: 0.0146  last_data_time: 0.0164   lr: 8.2668e-05  max_mem: 2750M
[03/24 01:11:56]:  eta: 0:03:04  iter: 119  total_loss: 1.126  loss_cls: 0.2496  loss_box_reg: 0.5879  loss_mask: 0.2272  loss_rpn_cls: 0.02268  loss_rpn_loc: 0.009278    time: 1.0457  last_time: 0.9841  data_time: 0.0126  last_data_time: 0.0182   lr: 9.9318e-05  max_mem: 2841M
[03/24 01:12:17]:  eta: 0:02:45  iter: 139  total_loss: 1.088  loss_cls: 0.2045  loss_box_reg: 0.6462  loss_mask: 0.1918  loss_rpn_cls: 0.03709  loss_rpn_loc: 0.01279    time: 1.0502  last_time: 1.1140  data_time: 0.0120  last_data_time: 0.0078   lr: 0.00011597  max_mem: 2841M
[03/24 01:12:38]:  eta: 0:02:24  iter: 159  total_loss: 0.7912  loss_cls: 0.1295  loss_box_reg: 0.4954  loss_mask: 0.111  loss_rpn_cls: 0.007541  loss_rpn_loc: 0.004359    time: 1.0478  last_time: 1.0345  data_time: 0.0104  last_data_time: 0.0046   lr: 0.00013262  max_mem: 2842M
[03/24 01:12:59]:  eta: 0:02:04  iter: 179  total_loss: 0.7232  loss_cls: 0.1283  loss_box_reg: 0.4466  loss_mask: 0.146  loss_rpn_cls: 0.02093  loss_rpn_loc: 0.01081    time: 1.0468  last_time: 1.0894  data_time: 0.0089  last_data_time: 0.0076   lr: 0.00014927  max_mem: 2842M
[03/24 01:13:21]:  eta: 0:01:43  iter: 199  total_loss: 0.5111  loss_cls: 0.1041  loss_box_reg: 0.3098  loss_mask: 0.0884  loss_rpn_cls: 0.01099  loss_rpn_loc: 0.006714    time: 1.0518  last_time: 1.0038  data_time: 0.0094  last_data_time: 0.0053   lr: 0.00016592  max_mem: 2842M
[03/24 01:13:41]:  eta: 0:01:22  iter: 219  total_loss: 0.4182  loss_cls: 0.07609  loss_box_reg: 0.2185  loss_mask: 0.0854  loss_rpn_cls: 0.01286  loss_rpn_loc: 0.006843    time: 1.0490  last_time: 1.0151  data_time: 0.0088  last_data_time: 0.0046   lr: 0.00018257  max_mem: 2842M
[03/24 01:14:03]:  eta: 0:01:02  iter: 239  total_loss: 0.3773  loss_cls: 0.07422  loss_box_reg: 0.1823  loss_mask: 0.08169  loss_rpn_cls: 0.00807  loss_rpn_loc: 0.006176    time: 1.0522  last_time: 0.9854  data_time: 0.0100  last_data_time: 0.0097   lr: 0.00019922  max_mem: 2842M
[03/24 01:14:24]:  eta: 0:00:41  iter: 259  total_loss: 0.3635  loss_cls: 0.07033  loss_box_reg: 0.1717  loss_mask: 0.07367  loss_rpn_cls: 0.01738  loss_rpn_loc: 0.005297    time: 1.0520  last_time: 1.1337  data_time: 0.0092  last_data_time: 0.0117   lr: 0.00021587  max_mem: 2842M
[03/24 01:14:45]:  eta: 0:00:20  iter: 279  total_loss: 0.3828  loss_cls: 0.08961  loss_box_reg: 0.1605  loss_mask: 0.1044  loss_rpn_cls: 0.007527  loss_rpn_loc: 0.008889    time: 1.0528  last_time: 1.0041  data_time: 0.0094  last_data_time: 0.0155   lr: 0.00023252  max_mem: 2842M
[03/24 01:15:10]:  eta: 0:00:00  iter: 299  total_loss: 0.3096  loss_cls: 0.07114  loss_box_reg: 0.1505  loss_mask: 0.07772  loss_rpn_cls: 0.009302  loss_rpn_loc: 0.006894    time: 1.0543  last_time: 1.1920  data_time: 0.0115  last_data_time: 0.0112   lr: 0.00024917  max_mem: 2842M
[03/24 01:15:11 d2.engine.hooks]: Overall training speed: 298 iterations in 0:05:14 (1.0543 s / it)
[03/24 01:15:11 d2.engine.hooks]: Total training time: 0:05:18 (0:00:04 on hooks)
[03/24 01:15:11]: Distribution of instances among all 1 categories:
|  category  | #instances   |
|  balloon   | 50           |
|            |              |
[03/24 01:15:11]: [DatasetMapper] Augmentations used in inference: [ResizeShortestEdge(short_edge_length=(800, 800), max_size=1333, sample_style='choice')]
[03/24 01:15:11]: Serializing the dataset using: <class ''>
[03/24 01:15:11]: Serializing 13 elements to byte tensors and concatenating them all ...
[03/24 01:15:11]: Serialized dataset takes 0.04 MiB
The evaluation results indicate the model’s performance on the validation dataset. The average precision (AP) scores at different Intersection over Union (IoU) thresholds and object sizes are reported.

Here’s a summary of the results:

Bounding Box AP: The average precision for bounding box detection ranges from 78.99% to 91.66% across different IoU thresholds and object sizes. This indicates the accuracy of object localization.

Segmentation AP: The average precision for instance segmentation ranges from 81.74% to 95.99%, indicating the accuracy of both object localization and pixel-wise segmentation.

Code Explanation:

inference_on_dataset(trainer.model, val_loader, evaluator): This function performs the actual evaluation. It takes the trained model (trainer.model), the validation data loader (val_loader), and the COCOEvaluator instance (evaluator). The function passes the validation dataset through the trained model to make predictions, then compares these predictions against the true labels to compute the evaluation metrics specified by the COCOEvaluator (e.g., precision, recall, mAP).

COCOEvaluator(“balloon_val”, cfg, False, output_dir=cfg.OUTPUT_DIR): This creates an instance of the COCOEvaluator, which is a class provided by Detectron2 for evaluating model performance using COCO metrics.

build_detection_test_loader(cfg, “balloon_val”): This function prepares the data loader for the validation dataset. It uses the configuration (cfg) to understand how to process the data and “balloon_val” to specify which dataset to load.

#Cell 10 - Evaluation

from detectron2.evaluation import COCOEvaluator
from import build_detection_test_loader
from detectron2.evaluation import inference_on_dataset

evaluator = COCOEvaluator("balloon_val", cfg, False, output_dir=cfg.OUTPUT_DIR)
val_loader = build_detection_test_loader(cfg, "balloon_val")
inference_on_dataset(trainer.model, val_loader, evaluator)
Setup Cfg for Inference

The inference section prepares the trained model for making predictions on new images by loading the trained weights and setting up inference parameters like the score threshold. After this setup, you can use the predictor object to perform inference on unseen images.

cfg.MODEL.WEIGHTS = os.path.join(cfg.OUTPUT_DIR, “model_final.pth”): This line specifies the path to the weights of the trained model. os.path.join(cfg.OUTPUT_DIR, “model_final.pth”) tells Detectron2 where to find the trained model so it can be loaded for making predictions.

cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.7: Here, the threshold for considering a detection to be valid during testing is set to 0.7. This means that any detection with a confidence score lower than 0.7 will be ignored. Adjusting this threshold can help balance between precision and recall, depending on the requirements of your specific application. A higher threshold generally leads to higher precision but lower recall, as fewer detections are considered confident enough.

predictor = DefaultPredictor(cfg): An instance of DefaultPredictor is created with the updated configuration. DefaultPredictor is a class provided by Detectron2 that simplifies making predictions with a trained model. It takes care of model loading and setting up the data pipeline for feeding images into the model.

#Cell 11 - Setup cfg for inference
from detectron2.engine import DefaultPredictor

cfg.MODEL.WEIGHTS = os.path.join(cfg.OUTPUT_DIR, "model_final.pth") # path to the model we just trained
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.7   # set a custom testing threshold
predictor = DefaultPredictor(cfg)
[03/24 01:21:50 d2.checkpoint.detection_checkpoint]: [DetectionCheckpointer] Loading from ./output/model_final.pth ...

Visualizing Model Predictions on Validation Data

Visualize the Prediction: A Visualizer instance is created with the image and metadata for the validation dataset. The instance_mode=ColorMode.IMAGE_BW argument ensures that unsegmented pixels are not colored, focusing attention on the segmented objects.

The visualized output is temporarily saved to a file, which is then displayed in the Jupyter Notebook using display(Image(

outputs = predictor(im) uses the previously configured predictor to detect objects in the loaded image, producing predictions that include object classes, bounding boxes, and masks.

After displaying, the temporary file is deleted with os.unlink( to clean up and free resources.

#Cell 12 - Visualizing model predictions on validation data

import random
from detectron2.utils.visualizer import Visualizer
from detectron2.utils.visualizer import ColorMode

# Load a random image from the validation set
dataset_dicts = get_balloon_dicts("balloon/val")
d = random.choice(dataset_dicts)
im = cv2.imread(d["file_name"])
outputs = predictor(im)

# Visualize the prediction
v = Visualizer(im[:, :, ::-1], metadata=MetadataCatalog.get("balloon_val"), scale= 0.5, instance_mode=ColorMode.IMAGE_BW)  # remove the colors of unsegmented pixels
out = v.draw_instance_predictions(outputs["instances"].to("cpu"))

# Display the image - adapt for Jupyter Notebook or standalone Python script as needed
from IPython.display import display, Image
import tempfile

# Create a temporary file to save the visualized output
with tempfile.NamedTemporaryFile(delete=False, suffix=".jpg") as tmpfile:
    cv2.imwrite(, out.get_image()[:, :, ::-1])
    # Display in Jupyter Notebook

# Cleanup the tempfile if necessary


Congratulations on completing the Mask R-CNN with Detectron2 tutorial!

We’ve covered a lot of content, from understanding the theoretical underpinnings of Mask R-CNN to implementing it using the Detectron2 library.

I hope this tutorial has provided you with a solid foundation and equipped you with the knowledge and tools to explore further on your own.

Back to top