The blog post introduces object detection (& instance segmentation) in computer vision.

Object detection is a computer vision task where we aim to locate and classify any number of objects within an image. For location, we use bounding boxes (bboxes), which can be clearly defined by the center coordinate, width, and height of the box. Object detection has higher real-world relevance than simpler computer vision tasks because we often need to locate and classify multiple objects to fully understand the environment (e.g., detecting trash and disposing of it properly, detecting humans and objects on the road).
In exchange for this high relevance, object detection is significantly more complex and requires more sophisticated solutions compared to simpler tasks. This is because we don't know the number of objects in an image beforehand, making it difficult to determine the appropriate output size, and models need to distinguish between object instances of the same class and objects with various sizes, which may overlap. In this article, we will introduce traditional approaches to performing object detection.
R-CNN & Fast R-CNN
A straightforward approach would be to slide convolutional layers with various kernel sizes over the image and use the outputs to predict whether each kernel represents a bounding box with a specific object class. However, this is an extremely inefficient approach that requires countless iterations with different kernel sizes. Instead, we can use a heuristic function to generate approximately 2000 region proposals. Then, we can resize these regions, run a CNN on them to predict the class and bounding box, and select the proposals with the highest confidence. This approach is called region-based CNN (R-CNN).

The details of the heuristic function will not be covered here, as better methods exist that don't rely on it. We can observe that running a CNN on each region is computationally expensive. Therefore, we can first perform feature extraction using a backbone (leveraging transfer learning to fine-tune the feature extractor of a pre-trained model), project the regional proposals onto the feature map, and perform class and bounding box prediction using a smaller head. This approach is called Fast R-CNN and is significantly faster for both training and inference.
Faster R-CNN & Mask R-CNN
Although Fast R-CNN is comparatively faster than traditional R-CNN, it spends over 90% of its inference time generating region proposals using a heuristic function on a CPU. Furthermore, the heuristic function is not learnable, consistently producing a large number of poor proposals alongside the good ones. Faster R-CNN addresses these issues by introducing a region proposal network (RPN) that generates region proposals using the feature map, making it the first end-to-end model for object detection.

The RPN is a small CNN that produces adjustments to predefined anchor boxes (3 sizes and 3 ratios) per pixel on the feature map and classifies them (whether the boxes contain an object or not). For example, for a feature map of size (32, 32, 128), the RPN outputs bounding box predictions (32, 32, 36, corresponding to 9 boxes * 4 quantities for adjustment (x', y', w', h')) and classifications (32, 32, 18, corresponding to 9 boxes * 2 (object, background)). The RPN can be trained to adjust the anchor boxes and classify them as objects if they have an IoU greater than 0.7 with the ground truth, and to classify boxes with an IoU less than 0.3 as background. The region proposals, adjusted from the anchor boxes and classified as objects, can be resized to match the size of the feature map and passed to smaller heads for bounding box and class prediction.
Faster R-CNN performs better and is significantly faster than Fast R-CNN, due to the trainable region proposal algorithm and end-to-end architecture. However, the method used to resize the bounding box predictions to fit the feature map, ROI pooling, was not effective. Therefore, Mask R-CNN replaced ROI pooling with ROI align, which is generally more effective. (The details of ROI pooling and ROI align are beyond the scope of this discussion, as we will see better techniques that don't involve them.) In addition to ROI align, Mask R-CNN adds masking or pixel-wise classification in the heads to perform instance segmentation. Unlike semantic segmentation, instance segmentation performs segmentation only on objects of interest and distinguishes between multiple object instances of the same class. We can see a strong connection between object detection a nd instance segmentation here.
Single-Stage Detector
In the previous sections, we introduced methods that produce region proposals either by fixed heuristic functions or by neural networks, which are called two-stage object detectors. However, we can notice that we can modify the Region Proposal Network (RPN) to be the head that directly produces bounding box predictions based on the anchor box approach and classification results. This simpler approach, involving only one step, is called single-stage object detection.
Single-stage detectors tend to be much faster than two-stage detectors because they don't require the extra steps of running a head on each region proposal. However, two-stage detectors have an additional bounding box correction mechanism, which historically contributed to higher accuracy. The reason I'm using the past tense here is because recent implementations of single-stage detectors, incorporating various tricks, now achieve state-of-the-art performance on object detection benchmarks. I will discuss these in future articles. (This is why we haven't discussed the specifics of heuristic functions, ROI pooling, and ROI align.)
Datasets & Metrics (mAP)
We're discussing object detection models, but it's important to address the metrics and datasets used to benchmark them. While many datasets exist for object detection, the Common Objects in Context (COCO) dataset is the largest and most popular, offering high-quality data for benchmarking. To create a high-quality custom dataset for object detection (or instance segmentation), you can use various data annotation tools (FiftyOne, Roboflow, etc.).
The metrics used for object detection are quite complex, as they need to capture the nuances of bounding box predictions and classifications. First, an Intersection over Union (IoU) threshold must be set to evaluate the model, as bounding boxes with poor IoUs don't have a corresponding ground truth box to compare against. We can choose to evaluate only the model outputs with IoU > 0.5 against ground truth boxes in the test set. We can then focus on evaluating the model's quality for a specific category, such as "dog," for example.
Bounding box predictions come with a confidence value (in the form of an object vs. background classification), and depending on the confidence threshold we set to disregard bounding boxes, precision and recall values change (a higher confidence threshold results in fewer false positives and more false negatives, or higher precision and lower recall). To account for this variable confidence threshold, we can calculate average precision, which is the integral of precision over recall. (You can see the details of the computation in the video linked here by Persson, A. (2021).) We can compute average precision for all categories and take the mean, which is mean average precision (mAP).
mAP is used as the standard metric for object detection because the preferred balance of precision and recall might vary depending on the application (we might need to prioritize recall and minimize false negatives for tumor detection by setting a low threshold and accepting the cost of false positives – false alarms – of taking extra medical tests). We are typically computing mAP with an IoU threshold of IoU > 0.5, referred to as mAP@0.5. However, we can use arbitrary IoU thresholds above 0.5, such as mAP@0.95. Therefore, we often compute mAPs with different thresholds, averaging them to create a metric like mAP@0.5:0.05:0.95, which is the average of mAPs for IoU thresholds of 0.5, 0.55, 0.6, ..., 0.95.
For instance segmentation, we often use an additional mAP with a threshold computed using ground truth and predicted masks. Usually, an mAP over 0.5 is considered impressive quality, as Mask R-CNN achieves approximately 0.4 depending on the dataset, and state-of-the-art models achieve near 0.6 in some datasets. Unlike simple image classification, which achieves 95%+ accuracy with state-of-the-art models across various datasets, we see lower metric values (which isn't a fair comparison between different metrics though), and whether you interpret this as a ceiling due to the inherent complexity of the problem or as room for improvement is up to you.
Conclusion
In this article, we covered what object detection is, several implementations of two-stage object detectors, and dataset and metrics, which highlighted the challenges inherent in the task. In the next article, we will delve deeper into modern single-step object detector solutions that achieve performance close to state-of-the-art.
Resources
- COCO. n.d. COCO. COCO.
- Kai. 2019. Faster R-CNNにおけるRPNの世界一分かりやすい解説. Medium.
- Michigan Online. 2021. Lecture 15: Object Detection. YouTube.
- Persson, A. 2021. Mean Average Precision (mAP) Explained and PyTorch Implementation. YouTube.