# Notes on "[Rich feature hierarchies for accurate object detection and semantic segmentation(RCNN)](https://arxiv.org/pdf/1311.2524.pdf)" ###### tags: `notes` `supervised learning` `object detection` `RCNN` ## Brief Outline This paper proposes a framework that handles object detection task in two steps, first being generation of region proposals in order to localize and segment objects and the second steps is about classifying these objects.This method improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 achieving a mAP of 53.3%. ## Introduction --- * Before this paper, features were hugely extracted using SIFT and HOG. * Prior works that involve the hierarchial method for pattern recognition include [neocognitron](https://link.springer.com/article/10.1007/BF00344251). * This paper is the first to show that a CNN can lead to dramatically higher object detection performance on PASCAL VOC as compared to systems based on simpler HOG-like features. * CNNs have been used as sliding window classifiers for at least two decades, typically on constrained object categories, such as faces and pedestrians. * In this paper due to large receptive fields of convolutional filters (195 × 195 pixels) and strides (32×32 pixels), the precise localization within the sliding-window paradigm is a challenge. * The inspiration for this work was taken from the paper [recognition using regions](https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/shape/glam-cvpr09.pdf). * At test time, our method generates around 2000 category-independent region proposals for the input image, extracts a fixed-length feature vector from each proposal using a CNN, and then classifies each region with category-specific linear SVMs. * The below given figure summarizes the entire stack, ![](https://i.imgur.com/i2dnjUz.png) * Another noteworthy contribution includes to show that supervised pre-training on a large auxiliary dataset (ILSVRC), followed by domainspecific fine-tuning on a small dataset (PASCAL), is an effective paradigm for learning high-capacity CNNs when data is scarce. * As we know that R-CNN operates on regions it is possible to extend it to the task of semantic segmentation. ## Object Detection --- ### Module design #### Region proposals * There are a numerous amount of papers on generating category-independent region proposals. Refer to the main paper for these examples. * While R-CNN is agnostic to the particular region proposal method, authors use selective search to enable a controlled comparison with prior detection work. #### Feature extraction * A 4096-dimensional feature vector from each region proposal the AlexNet. * The features are extracted by forward propagating a mean-subtracted 227 × 227 RGB image through five convolutional layers and two fully connected layers. * Regardless of the size or aspect ratio of the candidate region, we warp all pixels in a tight bounding box around it to the required size. * Prior to warping, we dilate the tight bounding box so that at the warped size there are exactly p pixels of warped image context around the original box (we use p = 16). ### Test time detection * Selective search algorithm is used on the test image to extract around 2000 region proposals. * The proposals are warped and forward propagated through the CNN in order to compute features. * Then for each class, each feature vector is given a score using a class specific SVM. * Greedy NMS is used to eliminate regions having high IOU with high scored regions. #### Run time analysis * The entire system is efficient because : 1. The CNN parameters are shared across all categories. The result of such sharing is that the time spent computing region proposals and features is amortized over all classes. 2. The feature vectors are low dimensional (4k dims) compared to other methods. ### Training #### Supervised pre-training * The AlexNet is pretrained on a large auxiliary dataset (ILSVRC2012 classification) using *image-level annotations only* (i.e. bounding box labels are not present). #### Domain-specific fine-tuning * The pretrained CNN is adapted to the new system of warped regions using SGD. * There are some minor changes in the classification layer of the CNN. * The authors treat all region proposals with ≥ 0.5 IoU overlap with a ground-truth box as positives for that box’s class (refer the paper for other hyper-parameter related details). #### Object category classifiers * It is like a binary classifier for a particular class. * It is unclear about how to label a region that partially overlaps an object. * This can be resolved with an IoU overlap threshold, below which regions are defined as negatives. * The positive examples are defined simply to be the ground-truth bounding boxes for each class. * After this linear SVMs are trained as binary classifiers. * Since the training data is too large to fit in memory, the standard hard negative mining method is adopted. ### Results on PASCAL VOC 2010 and ILSVRC2013 - Compared to the multi-feature, non-linear kernel SVM approach, RCNN achieved a large improvement in mAP, from 35.1% to 53.7% mAP, while also being much faster. - R-CNN achieves a mAP of 31.4%, which is significantly ahead of the second-best result of 24.3% from OverFeat. ### Visualisation * First-layer filters can be visualized directly and they capture oriented edges and opponent colors. Understanding the subsequent layers is more challenging. * The authors' idea was to single out a particular unit in the network and use it as if it were an object detector in its own right. That is, compute the unit’s activations on a large set of held-out region proposals, sort the proposals from highest to lowest activation, perform nonmaximum suppression, and then display the top-scoring regions. This method lets the selected unit “speak for itself” by showing exactly which inputs it fires on. ### Ablation study * The authors found that finetuning a pretrained model improved the performance by a significant margin. * The boost from finetuning was much larger for fully connected layers, which suggested that features learned by conv. layers in pretraining are general and most of the improvement is gained from learning domain-specific non-linear classifiers on top of them. * The choice of network architecture has a large effect on the RCNN's detection performance. ### Region proposals * Selective search was run in “fast mode” on each image in validation and test set (but not on images in train). * selective search is not scale invariant and so the number of regions produced depends on the image resolution. * ILSVRC image sizes range from very small to a few that are several mega-pixels, and so authors resized each image to a fixed width (500 pixels) before running selective search. ### Relationship to OverFeat - OverFeat can be seen (roughly) as a special case of R-CNN. If one were to replace selective search region proposals with a multi-scale pyramid of regular square regions and change the per-class bounding-box regressors to a single bounding-box regressor, then the systems would be very similar. - OverFeat is about 9x faster than RCNN. This speed comes from the fact that OverFeat’s sliding windows (i.e., region proposals) are not warped at the image level and therefore computation can be easily shared between overlapping windows. ### Semantic Segmentation - Region classification is a standard technique for semantic segmentation, thus R-CNN can be applied for segmentation tasks. - To facilitate a direct comparison with the then leading semantic segmentation system called O2P, the authors work within their open source framework. O2P uses CPMC to generate 150 region proposals per image and then predicts the quality of each region, for each class, using support vector regression. - They evaluate three strategies for computing features on CPMC regions, all of which begin by warping the rectangular window around the region to 227 × 227. 1. The first strategy (*full*) ignores the region’s shape and computes CNN features directly on the warped window, just like in detection. However, these features ignore the non-rectangular shape of the region. Two regions might have very similar bounding boxes while having very little overlap. 2. The second strategy (*fg*) computes CNN features only on a region’s foreground mask. They replace the background with the mean input so that background regions are zero after mean subtraction. 3. The third strategy (*full+fg*) simply concatenates the full and fg features; our experiments validate their complementarity. ### Results on VOC 2011 Segmentation Challenge - The *fg* strategy slightly outperforms *full*, indicating that the masked region shape provides a stronger signal. - However, *full+fg* achieves an average accuracy of 47.9%, their best result by a margin of 4.2% (also modestly outperforming O2P), indicating that the context provided by the *full* features is highly informative even given the *fg* features. - Notably, training the 20 SVRs on our *full+fg* features takes an hour on a single core, compared to 10+ hours for training on O2P features.