Region Based Convolutional Neural Networks

Region-based Convolutional Neural Networks (R-CNN) are a family of machine learning models for computer vision, and specifically object detection and localization.

[1] The original goal of R-CNN was to take an input image and produce a set of bounding boxes as output, where each bounding box contains an object and also the category (e.g. car or pedestrian) of the object.

In general, R-CNN architectures perform selective search[2] over feature maps outputted by a CNN.

R-CNN has been extended to perform other computer vision tasks, such as: tracking objects from a drone-mounted camera,[3] locating text in an image,[4] and enabling object detection in Google Lens.

Mask R-CNN also replaced ROIPooling with a new method called ROIAlign, which can represent fractions of a pixel.

R-CNN architecture
R-CNN architecture
Fast R-CNN
RoI pooling to size 2x2. In this example region proposal (an input parameter) has size 7x5.
Faster R-CNN
Mask R-CNN