Skip to main contentIn the previous chapters we have treated the perception subsystem mainly from starting the first principles that govern supervised learning to the deep learning architectures that can be used in computer vision. Now we are synthesizing these algorithms to pipelines that can potentially allow us to decompose the scene into objects. As discussed in the CNN introduction, humans have a unique ability to interpret scenes based on their ability to infer (reason) what they don’t see. This is the reason why scene understanding involves far more than just perception. In this chapter we will cover algorithms that allow us to detect and segment objects in the scene.
Detect objects in an image
Object detection is demonstrated in this short video clip that shows the end result - placing bounding boxes around classes of interest.
The difference between classification and object detection is shown below.
Difference between classification and detection
In classification we are given images (we can consider video clips as a sequence of images) and we are asked to produce the array of labels assigned to objects that are present in the frame. Typically in many datasets there is only one class and the images are cropped around the object. In localization, in addition to classification we are interested in locating (using for example a bounding box) each class in the frame. In object detection we are localizing multiple objects (some objects can be of the same class). Localization is a regression problem fundamentally (although its implementation may move far away from a regression setting). Mathematically we have,
y=p(x)
We try to come up with a function approximation to the true data distribution function p that maps the image x to the location of the bounding box y. We can uniquely represent the bounding box by the (x,y) coordinates of its upper left corner and its width and height [x,y,w,h]. Being a regression problem, as y is a floating point vector, we can use well known loss functions e.g. Cross-Entropy or MSE where the error is the Euclidean distance between the coordinates of the true bounding box and the estimated bounding box.
However, the regression approach does not work well in practice and has been superseded by the algorithms described later in this chapter.
Work on object detection spans 20 years and is impossible to cover every algorithmic approach — the interested reader can trace these developments in this survey.
Object detection roadmap
Since 2014, deep learning has surpassed classical ML in the detection competitions. We therefore focus only on deep architectures — more specifically on two-stage detectors that employ two key ingredients:
- Region proposals.
- Fully Convolutional Networks (FCNs).
Object detection involves three main stages: feature extraction, classification, and localization. In the literature the feature and classification stages are counted as one, and people refer to such architectures as two-stage.
An additional requirement is the ability to detect objects in almost real time (20 frames per second) — a significant subset of mission-critical applications requires it. We therefore focus on region-based detectors as the canonical CNN architecture for detection, and also cover single-stage detectors (YOLO family) that achieve higher throughput at some cost in accuracy.
Before we continue, you can try out a live demo using your webcam in the browser.
Semantic segmentation
Semantic segmentation in medical, robotic, and sports analytics applications
Both detection and segmentation abilities enable the reflexive part of perception where the inference ends up being a classification, regression, or search problem. Depending on the algorithm, inference can range from a few ms to hundreds of ms. Both are essential parts of many mission-critical, near-real-time applications such as robotics and self-driving cars.
There are other abilities needed for scene understanding that we cover later in this book. Our ability to recognize the attribute of uniqueness in an object and assign a symbol to it is fundamental to reasoning at the symbolic level. At that level we can use a whole portfolio of symbolic inference algorithms developed over the last few decades. But before we reach this level we need to solve the supervised learning problem for the relatively narrow task of bounding and coloring objects. This needs annotated data, and knowing what kind of data we have at our disposal is an essential skill.
Instance segmentation
Instance segmentation vs semantic segmentation
This is an even more complex problem than semantic segmentation, as we additionally need to color differently the different instances of the same class in the image.
Datasets for computer vision tasks
COCO
Typical example for detection, semantic segmentation, and image captioning tasks
After its publication by Microsoft, the COCO dataset has become the reference dataset to train models in perception tasks and it is constantly evolving through yearly competitions. The competitions are challenging as compared to earlier ones (e.g. VOC) since many objects are small. COCO’s 330K images are annotated with:
- 80 object classes. These are the so-called thing classes (person, car, elephant, …).
- 91 stuff classes. These are the so-called stuff classes (sky, grass, wall, …). Stuff classes cover the majority of the pixels in COCO (~66%). Stuff classes are important as they allow to explain important aspects of an image, including scene type, which thing classes are likely to be present and their location (through contextual reasoning), physical attributes, material types and geometric properties of the scene.
- 5 captions per image
- Keypoints for the “person” class
Common perception tasks that the dataset can be used for include:
- Detection Task: Object detection and semantic segmentation of thing classes.
- Stuff Segmentation Task: Semantic segmentation of stuff classes.
- Keypoints Task: Localization of person’s keypoints (sparse skeletal points).
- DensePose Task: Localization of people’s dense keypoints, mapping all human pixels to a 3D surface of the human body.
- Panoptic Segmentation Task: Scene segmentation, unifying semantic and instance segmentation tasks. Task is across thing and stuff classes.
- Image Captioning Task: Describing with natural language text the image. This task ended in 2015. Image captioning is very important though and other datasets exist to supplement the curated COCO captions.
Even in a world with so much data, the curated available datasets that can be used to train models are by no means enough to solve AI problems in any domain.
Firstly, datasets are geared towards competitions that supposedly can advance the science but in many instances leader boards become “academic exercises” where 0.1% mean accuracy improvement can win the competition but definitely does not progress AI. The double digit improvements can and these discoveries create clusters of implementations and publications around them that fine tune them. One of these discoveries is the RCNN architecture that advanced the accuracy metric by almost 30%.
Secondly, the scene understanding problems that AI engineers will face in the field, e.g. in industrial automation or drug discovery, involve domain specific classes of objects. Although we can’t directly use curated datasets, engineers can do transfer learning, where a dataset is used to train a model for a given task whose weights can be reused to train a model for fairly similar task.
Key references: (Lin et al., 2014; Zhou et al., 2014; Zhou et al., 2014; Cordts et al., 2016; Redmon et al., 2015; Ren et al., 2015; Redmon & Farhadi, 2016)
References
- Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., et al. (2016). The Cityscapes Dataset for Semantic Urban Scene Understanding.
- Lin, T., Maire, M., Belongie, S., Bourdev, L., Girshick, R., et al. (2014). Microsoft COCO: Common Objects in Context.
- Redmon, J., Divvala, S., Girshick, R., Farhadi, A. (2015). You only look once: Unified, real-time object detection.
- Redmon, J., Farhadi, A. (2016). YOLO9000: Better, Faster, Stronger.
- Ren, S., He, K., Girshick, R., Sun, J. (2015). Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.
- Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A. (2014). Object detectors emerge in deep scene CNNs.
- Zhou, B., Khosla, A., Lapedriza, À., Oliva, A., Torralba, A. (2014). Object detectors emerge in deep scene CNNs.